Introduction to Stream Processing Guido Schmutz Frankfurt - - - PowerPoint PPT Presentation

introduction to stream processing
SMART_READER_LITE
LIVE PREVIEW

Introduction to Stream Processing Guido Schmutz Frankfurt - - - PowerPoint PPT Presentation

Introduction to Stream Processing Guido Schmutz Frankfurt - 21.2.2019 @gschmutz guidoschmutz.wordpress.com BASEL BERN BRUGG DSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MNCHEN


slide-1
SLIDE 1

@gschmutz

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH

Introduction to Stream Processing

Guido Schmutz Frankfurt - 21.2.2019

@gschmutz guidoschmutz.wordpress.com

slide-2
SLIDE 2

@gschmutz

Agenda

Introduction to Stream Processing

1. Motivation for Stream Processing? 2. Capabilities for Stream Processing 3. Implementing Stream Processing Solutions 4. Demo 5. Summary

slide-3
SLIDE 3

@gschmutz

Guido Schmutz

Working at Trivadis for more than 22 years Oracle Groundbreaker Ambassador & Oracle ACE Director Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 145th edition

Introduction to Stream Processing

slide-4
SLIDE 4

@gschmutz

Motivation for Stream Processing?

Introduction to Stream Processing

slide-5
SLIDE 5

@gschmutz

Bulk Source

Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools

Enterprise Data Warehouse SQL

Search / Explore

Search S Q L E x p

  • r

t Service Parallel Processing Storage Storage

Raw Refined

Results

high latency

Enterprise Apps

Logic

{ }

API File Import / SQL Import DB Extract File DB

Big Data solves Volume and Variety – not Velocity

Introduction to Stream Processing

slide-6
SLIDE 6

@gschmutz

Bulk Source

Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools

Enterprise Data Warehouse SQL

Search / Explore

Search S Q L E x p

  • r

t Service Parallel Processing Storage Storage

Raw Refined

Results

high latency

Enterprise Apps

Logic

{ }

API File Import / SQL Import DB Extract File DB Event Source Location Telemetry IoT Data Mobile Apps Social

Big Data solves Volume and Variety – not Velocity

Introduction to Stream Processing

Event Stream

slide-7
SLIDE 7

@gschmutz

Bulk Source

Hadoop Clusterd Hadoop Cluster Big Data Platform BI Tools

Enterprise Data Warehouse SQL

Search / Explore

Search S Q L E x p

  • r

t Service

  • Machine Learning
  • Graph Algorithms
  • Natural Language Processing

Parallel Processing Storage Storage

Raw Refined

Results

high latency

Enterprise Apps

Logic

{ }

API File Import / SQL Import DB Extract File DB Event Stream Event Source Location IoT Data Mobile Apps Social

Big Data solves Volume and Variety – not Velocity

Introduction to Stream Processing Event Hub Event Hub Event Hub

Telemetry

slide-8
SLIDE 8

@gschmutz

"Data at Rest" vs. "Data in Motion"

Data at Rest Data in Motion

Store Act Analyze Store Act Analyze

11101 01010 10110 11101 01010 10110 Introduction to Stream Processing

slide-9
SLIDE 9

@gschmutz

When to Stream / When not?

Introduction to Stream Processing

Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch

Source: adapted from Cloudera

slide-10
SLIDE 10

@gschmutz

"No free lunch"

Introduction to Stream Processing

Constant low Milliseconds & under Low milliseconds to seconds, delay in case of failures 10s of seconds of more, Re-run in case of failures Real-Time Near-Real-Time Batch "Difficult" architectures, lower latency "Easier architectures", higher latency

slide-11
SLIDE 11

@gschmutz Event Hub Event Hub Hadoop Clusterd Hadoop Cluster Stream Analytics Platform

Stream Processing Architecture solves Velocity

BI Tools

Enterprise Data Warehouse

Event Hub

SQL

Search / Explore Enterprise Apps

Search Service Results Stream Analytics Reference / Models Dashboard Logic

{ }

API Event Stream Event Stream Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social

Introduction to Stream Processing

Low(est) latency, no history

Telemetry

slide-12
SLIDE 12

@gschmutz Hadoop Clusterd Hadoop Cluster Stream Analytics Platform

Big Data for all historical data analysis

BI Tools

Enterprise Data Warehouse SQL

Search / Explore Enterprise Apps

Search Service Results Stream Analytics Reference / Models Dashboard Logic

{ }

API Event Stream Event Stream

Hadoop Clusterd Hadoop Cluster Big Data Platform

Parallel Processing Storage Storage

Raw Refined

Results Data Flow

Event Hub

Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social File Import / SQL Import

Introduction to Stream Processing

Telemetry

slide-13
SLIDE 13

@gschmutz Hadoop Clusterd Hadoop Cluster Stream Analytics Platform

Integrate existing systems with lower latency through CDC

BI Tools

Enterprise Data Warehouse SQL

Search / Explore Enterprise Apps

Search Service Results Stream Analytics Reference / Models Dashboard Logic

{ }

API

Hadoop Clusterd Hadoop Cluster Big Data Platform

Parallel Processing Storage Storage

Raw Refined

Results File Import / SQL Import Event Stream Event Stream Data Flow

Event Hub

Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Change Data Capture

Introduction to Stream Processing

Telemetry

slide-14
SLIDE 14

@gschmutz

New systems participate in event-oriented fashion

Hadoop Clusterd Hadoop Cluster Big Data Platform

Parallel Processing Storage Storage

Raw Refined

Results

Microservice Platform

Microservice State

{ }

API

Stream Analytics Platform

Stream Processor State

{ }

API Event Stream SQL Search Service

BI Tools

Enterprise Data Warehouse

Search / Explore

S Q L E x p

  • r

t Search Service

Enterprise Apps

Logic

{ }

API File Import / SQL Import Event Stream Data Flow

Event Hub

Event Stream Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Change Data Capture Event Stream Event Stream

Introduction to Stream Processing

Telemetry

slide-15
SLIDE 15

@gschmutz

Edge computing allows processing close to data sources

Hadoop Clusterd Hadoop Cluster Big Data Platform

Parallel Processing Storage Storage

Raw Refined

Results

Microservice Platform

Microservice State

{ }

API

Stream Analytics Platform

Stream Processor State

{ }

API SQL Search Service

BI Tools

Enterprise Data Warehouse

Search / Explore

S Q L E x p

  • r

t Search Service

Enterprise Apps

Logic

{ }

API Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social

Edge Node

File Import / SQL Import C h a n g e D a t a C a p t u r e Data Flow

Event Hub

D a t a F l

  • w

Event Stream Event Stream Event Stream

Introduction to Stream Processing

Telemetry Rules Event Hub Storage

slide-16
SLIDE 16

@gschmutz Hadoop Clusterd Hadoop Cluster Big Data

Unified Architecture for Modern Data Analytics Solutions

SQL Search Service

BI Tools

Enterprise Data Warehouse

Search / Explore

File Import / SQL Import

Event Hub

Data Flow Data Flow C h a n g e D a t a C a p t u r e Parallel Processing Storage Storage

Raw Refined

Results S Q L E x p

  • r

t Microservice State

{ }

API Stream Processor State

{ }

API Event Stream Event Stream Search Service

Stream Analytics Microservices Enterprise Apps

Logic

{ }

API

Edge Node

Rules Event Hub Storage Bulk Source Event Source Location DB Extract File DB IoT Data Mobile Apps Social Event Stream Telemetry

Introduction to Stream Processing

slide-17
SLIDE 17

@gschmutz

Two Types of Stream Processing (by Gartner)

Introduction to Stream Processing

Stream Data Integration

  • focuses on the ingestion and processing of

data sources targeting real-time extract- transform-load (ETL) and data integration use cases

  • filter and enrich the data

Stream Analytics

  • targets analytics use cases
  • calculating aggregates and detecting

patterns to generate higher-level, more relevant summary information (complex events)

  • Complex events may signify threats or
  • pportunities that require a response from

the business

Gartner: Market Guide for Event Stream Processing, Nick Heudecker, W. Roy Schulte

slide-18
SLIDE 18

@gschmutz

Stream Processing & Analytics Ecosystem

Stream Analytics Event Hub Open Source Closed Source Stream Data Integration

Source: adapted from Tibco

Edge

Introduction to Stream Processing

slide-19
SLIDE 19

@gschmutz

Stream vs. Table / Static

Stream

“History” an unbounded sequence of structured data ("facts") Facts in a stream are immutable

Table / Static

“State” a view of a stream, or another table, and represents a collection of evolving facts Latest value for each key in a stream Facts in a table are mutable

Introduction to Stream Processing

slide-20
SLIDE 20

@gschmutz Introduction to Stream Processing

Important Capabilities for Stream Processing

slide-21
SLIDE 21

@gschmutz

Capabilities: Stream Data Integration vs. Stream Analytics

Stream Data Integration Stream Analytics Support for Various Data Sources yes partial Streaming ETL (Transformation/Format Translation, Routing, Validation) yes partial Micro-Batching yes partial Event-at-a-time yes yes Delivery Guarantees yes yes API : GUI-Based API / Declarative API / Programmatic yes yes API: Streaming SQL

  • yes

Event Time vs. Ingestion / Processing Time

  • yes

Windowing

  • yes

Stream-to-Static Joins (Lookup/Enrichment) partial yes Stream-to-Stream Joins

  • yes

State Management

  • yes

Queryable State (aka Interactive Queries)

  • yes

Event Pattern Detection

  • Yes

Introduction to Stream Processing

slide-22
SLIDE 22

@gschmutz

Integrating Data Sources

Introduction to Stream Processing

SQL Polling Change Data Capture (CDC) File Polling File Stream (File Tailing) File Stream (Appender) Sensor Stream

slide-23
SLIDE 23

@gschmutz

Streaming ETL

Introduction to Stream Processing

  • Flow-based ”programming”
  • Ingest Data from various sources
  • Extract – Transform – Load
  • High-Throughput, straight-through

data flows

  • Data Lineage
  • Batch- or Stream-Processing
  • Visual coding with flow editor
  • Event Stream Processing (ESP) but

not Complex Event Processing (CEP)

Source: Confluent

slide-24
SLIDE 24

@gschmutz

Event-at-a-time vs. Micro Batch

Introduction to Stream Processing

Event-at-a-time Processing

  • Events processed as they

arrive

  • low-latency
  • fault tolerance expensive

Micro-Batch Processing

  • Splits incoming stream in

small batches

  • Fault tolerance easier
  • Better throughput
slide-25
SLIDE 25

@gschmutz

Delivery Guarantees

Introduction to Stream Processing

At most once (fire-and-forget) message is sent, but the sender doesn’t care if it’s received or lost. it is the easiest and most performant behavior to support. At least once retransmission of a message will occur until an acknowledgment is received. Since a delayed acknowledgment from a receiver could be in flight when the sender retransmits, the message may be received one or more times. Exactly once ensures that a message is received once and only once, and is never lost and never repeated. The system must implement whatever mechanisms are required to ensure that a message is received and processed just once

[ 0 | 1 ] [ 1+ ] [ 1 ]

slide-26
SLIDE 26

@gschmutz

API

Introduction to Stream Processing

GUI-based / Drag-and-Drop

  • A graphical way of designing a plipeline
  • Often web-based to improve usability

Declarative

  • An streaming engine which can be

configured in a declarative way

  • JSON, YML

Programmatic

  • Low-level (class) or high-level fluent

API

  • Higher order function as operators

(filter, mapWithState …) Streaming SQL

  • use a stream in a FROM clause
  • Extensions supporting windowing, pattern

matching, spatial, …. Operators

val filteredDf = truckPosDf. where("eventType !='Normal'") SELECT * FROM truck_position_s WHERE eventType != ’Normal’

"config": { "connector.class": "io.confluent.connect.mqtt.MqttSourceConnector", "tasks.max": "1", "mqtt.server.uri": "tcp://mosquitto-1:1883", "mqtt.topics": "truck/+/position", "kafka.topic":"truck_position", ...

slide-27
SLIDE 27

@gschmutz

Event Time vs. Ingestion / Processing Time

Introduction to Stream Processing

Event time

the time at which events actually occurred

Ingestion time / Processing Time

the time at which events are ingested into / processed by the system

Not all use cases care about event times but many do Examples

  • characterizing user behavior over time
  • most billing applications
  • anomaly detection
slide-28
SLIDE 28

@gschmutz

Windowing

Introduction to Stream Processing

Computations over events done using windows of data Due to size and never-ending nature of it, it’s not feasible to keep entire stream of data in memory A window of data represents a certain amount of data where we can perform computations on Windows give the power to keep a working memory and look back at recent data efficiently

Time

Stream of Data

Window of Data

slide-29
SLIDE 29

@gschmutz

Sliding Window (aka Hopping Window) - uses eviction and trigger policies that are based on time: window length and sliding interval length Fixed Window (aka Tumbling Window) - eviction policy always based on the window being full and trigger policy based on either the count of items in the window or time Session Window – composed of sequences of temporarily related events terminated by a gap of inactivity greater than some timeout

Windowing

Time Time Time

Introduction to Stream Processing

slide-30
SLIDE 30

@gschmutz

Joining – Stream-to-Static and Stream-to-Stream

Introduction to Stream Processing

Challenges of joining streams

1. Data streams need to be aligned as they come because they have different timestamps 2. since streams are never-ending, the joins must be limited; otherwise join will never end 3. join needs to produce results continuously as there is no end to the data

Stream-to-Static (Table) Join Stream-to-Stream Join (one window join) Stream-to-Stream Join (two window join)

Stream-to- Static Join Stream-to- Stream Join Stream-to- Stream Join

Time Time Time

slide-31
SLIDE 31

@gschmutz

State Management

Introduction to Stream Processing

Necessary if stream processing use case is dependent on previously seen data or external data Windowing, Joining and Pattern Detection use State Management behind the scenes State Management services can be made available for custom state handling logic State needs to be managed as close to the stream processor as possible Options for State Management How does it handle failures? If a machine crashes and the/some state is lost?

In-Memory Replicated, Distributed Store Local, Embedded Store Operational Complexity and Features Low high

slide-32
SLIDE 32

@gschmutz

Queryable State (aka. Interactive Queries)

Introduction to Stream Processing

Exposes the state managed by the Stream Analytics solution to the

  • utside world

Allows an application to query the managed state, i.e. to visualize it For some scenarios, Queryable State can eliminate the need for an external database to keep results

Stream Processing Cluster

Reference Data Stream Analytics

{ }

Query API State Stream Processor

Search / Explore Online & Mobile Apps

Model

Dashboard

slide-33
SLIDE 33

@gschmutz

Event Pattern Detection

Introduction to Stream Processing

  • Streaming Data often contains interesting patterns that only emerge as new

streaming data arrives, e.g.

  • Absence Pattern:

event A not followed by event B within time window

  • Sequence Pattern:

event A followed by event B followed by event C

  • Increasing Pattern:

up trend of a value of a certain attribute

  • Decreasing Pattern:

down trend of a value of a certain attribute

  • Pattern operators allow developers to define complex relationships between

streaming events

slide-34
SLIDE 34

@gschmutz

Capabilities: Stream Data Integration vs. Stream Analytics

Stream Data Integration Stream Analytics Support for Various Data Sources yes

  • Streaming ETL (Transformation/Format Translation, Routing, Validation)

yes partial Micro-Batching yes partial Event-at-a-time yes yes Delivery Guarantees yes yes API : GUI-Based API / Declarative API / Programmatic yes yes API: Streaming SQL

  • yes

Event Time vs. Ingestion / Processing Time

  • yes

Windowing

  • yes

Stream-to-Static Joins (Lookup/Enrichment) partial yes Stream-to-Stream Joins

  • yes

State Management

  • yes

Queryable State (aka Interactive Queries)

  • yes

Event Pattern Detection

  • Yes

Introduction to Stream Processing

slide-35
SLIDE 35

@gschmutz

Implementing Stream Processing Solutions

Introduction to Stream Processing

slide-36
SLIDE 36

@gschmutz

Stream Processing & Analytics Ecosystem

Stream Analytics Event Hub Open Source Closed Source Stream Data Integration

Source: adapted from Tibco

Edge

Introduction to Stream Processing

slide-37
SLIDE 37

@gschmutz

Highly available, Pub/Sub infrastructure Highly Scalable

Event Hub: Apache Kafka

Distributed Log at the Core Logs do not (necessarily) forget

  • Never
  • Time (TTL) or Size-based
  • Log-Compacted based
slide-38
SLIDE 38

@gschmutz

Stream Data Integration: Kafka Connect

Introduction to Stream Processing

Many connectors available Implement custom connectors using Java Supported by Confluent

#!/bin/bash curl -X "POST" "http://192.168.69.138:8083/connectors" \

  • H "Content-Type: application/json" \
  • d $'{

"name": "mqtt-source", "config": { "connector.class": ”...MqttSourceConnector", "tasks.max": "1", "name": "mqtt-source", "mqtt.server.uri": "tcp://mosquitto:1883", "mqtt.topics": "truck/+/position", "kafka.topic":"truck_position", } }'

declarative style data flows simplicity - “simple things done simple” very well integrated with Kafka – framework is part of Kafka Single Message Transforms (SMT)

slide-39
SLIDE 39

@gschmutz

Stream Data Integration: StreamSets

Continuous open source, intent-driven, big data ingest Visible, record-oriented approach fixes combinatorial explosion Both stream and batch processing

  • Standalone, Spark cluster, MapReduce

cluster

IDE for pipeline development by ‘civilians’ special option for Edge computing custom sources, sinks, processors Supported by StreamSets

slide-40
SLIDE 40

@gschmutz

Streaming Analytics: Kafka Streams

Designed as a simple and lightweight library in Apache Kafka no other dependencies than Kafka Supports fault-tolerant local state Supports Windowing (Fixed, Sliding and Session) and Stream-Stream / Stream- Table Joins Millisecond processing latency, no micro- batching At-least-once and exactly-once processing guarantees

KTable<Integer, Customer> customers = builder.stream(”customer"); KStream<Integer, Order> orders = builder.stream(”order"); KStream<Integer, String> enriched =

  • rders.leftJoin(customers, …);

joined.to(”orderEnriched");

trucking_ driver

Kafka Broker Java Application Kafka Streams

slide-41
SLIDE 41

@gschmutz

Streaming Analytics: KSQL

STREAM and TABLE as first-class citizens

  • STREAM = data in motion
  • TABLE = collected state of a stream

Stream Processing with zero coding using SQL-like language Built on top of Kafka Streams Interactive (CLI) and headless (command file)

ksql> CREATE STREAM order_s \ WITH (kafka_topic=‘order', \ value_format=‘AVRO'); Message

  • Stream created

ksql> SELECT * FROM order_s \ WHERE address->country = ‘Switzerland’; ...

trucking_ driver

Kafka Broker KSQL Engine Kafka Streams KSQL CLI Commands

slide-42
SLIDE 42

@gschmutz

Streaming Analytics: Spark Structured Streaming

Introduction to Stream Processing

2nd generation (1st to be Spark Streaming) Structured API through DataFrames / Datasets rather than RDDs Easier code reuse between batch and streaming marked production ready in Spark 2.2.0 Support for Java, Scala, Python, R and SQL

val oderDf = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker-1:9092") .option("subscribe", ”order") .load() val orderFilteredDf =

  • rderDf.where(

”address.county = ‘Switzerland'")

slide-43
SLIDE 43

@gschmutz

Demo

Introduction to Stream Processing

slide-44
SLIDE 44

@gschmutz

Sample Use Case

Truck-2 Truck-1 Truck-3 truck_ position detect_danger

  • us_driving

Truck Driver Change Data Capture join_dangerous_driv ing_driver dangerous_dri ving_driver Count By Event Type Window (1m, 30s) count_by_event _type Introduction to Stream Processing

slide-45
SLIDE 45

@gschmutz

Summary

Introduction to Stream Processing

slide-46
SLIDE 46

@gschmutz

Summary

Introduction to Stream Processing

Stream Processing is the solution for low-latency Event Hub, Stream Data Integration and Stream Analytics are the main building blocks in your architecture Kafka is currently the de-facto standard for Event Hub Various options exists for Stream Data Integration and Stream Analytics SQL becomes a valid option for implementing Stream Analytics Still room for improvements (SQL, Event Pattern Detection, Streaming Machine Learning)

slide-47
SLIDE 47

@gschmutz Introduction to Stream Processing

Technology on its own won't help you.

You need to know how to use it properly.