P Patterns Of O Streaming Applications S A Monal Daxini 11/ 6 - - PowerPoint PPT Presentation

p patterns of o streaming applications s a
SMART_READER_LITE
LIVE PREVIEW

P Patterns Of O Streaming Applications S A Monal Daxini 11/ 6 - - PowerPoint PPT Presentation

P Patterns Of O Streaming Applications S A Monal Daxini 11/ 6 / 2018 @ monaldax Profile 4+ years building stream processing platform at Netflix Drove technical vision, roadmap, led implementation 17+ years building distributed


slide-1
SLIDE 1

Monal Daxini 11/ 6 / 2018 @ monaldax

Patterns Of Streaming Applications P O S A

slide-2
SLIDE 2
  • 4+ years building stream processing platform at Netflix
  • Drove technical vision, roadmap, led implementation
  • 17+ years building distributed systems

Profile

@monaldax

slide-3
SLIDE 3

Set The Stage 8 Patterns

5 Functional

3 Non-Functional

Structure Of The Talk

Stream Processing ?

@monaldax

slide-4
SLIDE 4

Disclaimer

Inspired by True Events encountered building and operating a Stream Processing platform, and use cases that are in production or in ideation phase in the cloud. Some code and identifying details have been changed, artistic liberties have been taken, to protect the privacy of streaming applications, and for sharing the know-how. Some use cases may have been simplified.

@monaldax

slide-5
SLIDE 5

Stream Processing?

Processing Data-In-Motion

@monaldax

slide-6
SLIDE 6
slide-7
SLIDE 7

Lower Latency Analytics

slide-8
SLIDE 8

User Activity Stream - Batched

Flash Jessica Luke

Feb 25 Feb 26

@monaldax

slide-9
SLIDE 9

Sessions - Batched User Activity Stream

Feb 25 Feb 26

@monaldax

Flash Jessica Luke

slide-10
SLIDE 10

Correct Session - Batched User Activity Stream

Feb 25 Feb 26

@monaldax

Flash Jessica Luke

slide-11
SLIDE 11

Stream Processing Natural For User Activity Stream Sessions

@monaldax

Flash Jessica Luke

slide-12
SLIDE 12

1.

Low latency insights and analytics

2.

Process unbounded data sets

3.

ETL as data arrives

4.

Ad-hoc analytics and Event driven applications

Why Stream Processing?

@monaldax

slide-13
SLIDE 13

Set The Stage

Architecture & Flink

slide-14
SLIDE 14

Stream Processing App Architecture Blueprint

Stream Processing Job Source

@monaldax

Sink

slide-15
SLIDE 15

Stream Processing App Architecture Blueprint

Stream Processing Job Source

@monaldax

Sinks Source Source

Side Input

slide-16
SLIDE 16

Why Flink?

slide-17
SLIDE 17

Flink Programs Are Streaming Dataflows – Streams And Transformation Operators

Image adapted, source: Flink Docs

@monaldax

slide-18
SLIDE 18

Streams And Transformation Operators - Windowing

Image source: Flink Docs

10 Second

@monaldax

slide-19
SLIDE 19

Streaming Dataflow DAG

Image adapted, source: Flink Docs

@monaldax

slide-20
SLIDE 20

Scalable Automatic Scheduling Of Operations

Image adapted, source: Flink Docs

Job Manager

(Process) (Process) (Process) @monaldax Parallelism 2 Sink 1

slide-21
SLIDE 21

Flexible Deployment

Bare Metal VM / Cloud Containers

@monaldax

slide-22
SLIDE 22

Image adapted from: Stephan Ewen @monaldax

Stateless Stream Processing

No state maintained across events

slide-23
SLIDE 23

Streaming Application Flink TaskManager

Local State

Fault-tolerant Processing – Stateful Processing

Sink

Savepoints

(Explicitly Triggered)

Checkpoints

(Periodic, Asynchronous, Incremental Checkpoint)

Source / Producers

@monaldax In-Memory / On-Disk Local State Access

slide-24
SLIDE 24

Levels Of API Abstraction In Flink

Source: Flink Documentation

slide-25
SLIDE 25

Describing Patterns

@monaldax

slide-26
SLIDE 26
  • Use Case / Motivation
  • Pattern
  • Code Snippet & Deployment mechanism
  • Related Pattern, if any

Describing Design Patterns

@monaldax

slide-27
SLIDE 27

Patterns

Functional

slide-28
SLIDE 28
  • 1. Configurable Router

@monaldax

slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
  • Create ingest pipelines for different event streams declaratively
  • Route events to data warehouse, data stores for analytics
  • With at-least-once semantics
  • Streaming ETL - Allow declarative filtering and projection

1.1 Use Case / Motivation – Ingest Pipelines

@monaldax

slide-32
SLIDE 32

1.1 Keystone Pipeline – A Self-serve Product

  • SERVERLESS
  • Turnkey – ready to use
  • 100% in the cloud
  • No code, Managed Code & Operations

@monaldax

slide-33
SLIDE 33

1.1 UI To Provision 1 Data Stream, A Filter, & 3 Sinks

slide-34
SLIDE 34

1.1 Optional Filter & Projection (Out of the box)

slide-35
SLIDE 35

1.1 Provision 1 Kafka Topic, 3 Configurable Router Jobs

play_events Consumer Kafka Elasticsearch

@monaldax

Events

Configurable Router Job

Filter Projection Connector

Configurable Router Job

Filter Projection Connector

Configurable Router Job

Filter Projection Connector

1 2 3 4 5 6 7

Fan-out: 3

R

slide-36
SLIDE 36

@monaldax

1.1 Keystone Pipeline Scale

  • Up to 1 trillion new events / day
  • Peak: 12M events / sec, 36 GB / sec
  • ̴ 4 PB of data transported / day
  • ̴ 2000 Router Jobs / 10,000 containers
slide-37
SLIDE 37

1.1 Pattern: Configurable Isolated Router

@monaldax

Sink Configurable Router Job

Declarative Processors Declarative Processors Events Producer

slide-38
SLIDE 38

val ka#aSource = getSourceBuilder.fromKa#a("topic1").build() val selectedSink = getSinkBuilder() .toSelector(sinkName).declareWith("ka,asink", ka#aSink) .or("s3sink", s3Sink).or("essink", esSink).or("nullsink", nullSink).build(); ka#aSource .filter(KeystoneFilterFunc6on).map(KeystoneProjec6onFunc6on) .addSink(selectedSink)

1.1 Code Snippet: Configurable Isolated Router

No User Code @monaldax

slide-39
SLIDE 39
  • Popular stream / topic has high fan-out factor
  • Requires large Kafka Clusters, expensive

1.2 Use Case / Motivation – Ingest large streams with high fan-out Efficiently

@monaldax R R

Filter Projection Kafka

R

TopicA Cluster1 TopicB Cluster1 Events Producer

slide-40
SLIDE 40

1.2 Pattern: Configurable Co-Isolated Router

@monaldax

Co-Isolated Router

R

Filter Projection Kafka TopicA Cluster1 TopicB Cluster1

Merge Routing To Same Kafka Cluster

Events Producer

slide-41
SLIDE 41

ui_A_Clicks_KakfaSource .filter(filter) .map(projection) .map(outputConverter) .addSink(kafkaSinkA_Topic1)

1.2 Code Snippet: Configurable Co-Isolated Router

ui_A_Clicks_KafkaSource .map(transformer) .flatMap(outputFlatMap) .map(outputConverter) .addSink(kafkaSinkA_Topic2)

No User Code @monaldax

slide-42
SLIDE 42
  • 2. Script UDF* Component

[Static / Dynamic]

*UDF – User Defined Function

@monaldax

slide-43
SLIDE 43
  • 2. Use Case / Motivation – Configurable Business Logic Code

for operations like transformations and filtering

Managed Router / Streaming Job Source Sink

Biz Logic

Job DAG

@monaldax

slide-44
SLIDE 44
  • 2. Pattern: Static or Dynamic Script UDF (stateless) Component

Comes with all the Pros and Cons of scripting engine

Streaming Job Source Sink UDF

Script Engine executes function defined in the UI

@monaldax

slide-45
SLIDE 45

val xscript = new DynamicConfig("x.script") kakfaSource .map(new ScriptFunc>on(xscript)) .filter(new ScriptFunc>on(xsricpt2)) .addSink(new NoopSink())

  • 2. Code Snippet: Script UDF Component

// Script Function val sm = new ScriptEngineManager() ScriptEngine se = m.getEngineByName ("nashorn"); se .eval(script)

Contents configurable at runtime

@monaldax

slide-46
SLIDE 46
  • 3. The Enricher

@monaldax

slide-47
SLIDE 47

Next 3 Patterns (3-5) Require Explicit Deployment

@monaldax

slide-48
SLIDE 48
  • 3. User Case - Generating Play Events For

Personalization And Show Discovery

@monaldax

slide-49
SLIDE 49

@monaldax

  • 3. Use-case: Create play events with current data from services, and lookup

table for analytics. Using lookup table keeps originating events lightweight

Playback History Service Video Metadata

Play Logs Service call Periodically updating lookup data

Streaming Job

Resource Rate Limiter

slide-50
SLIDE 50
  • 3. Pattern: The Enricher
  • Rate limit with source or service rate limiter, or with resources
  • Pull or push data, Sync / async

Streaming Job Source Sink

Side Input

  • Service call
  • Lookup from Data Store
  • Static or Periodically updated lookup data

@monaldax Source / Service Rate Limiter

slide-51
SLIDE 51
  • 3. Code Snippet: The Enricher

val kafkaSource = getSourceBuilder.fromKafka("topic1").build() val parsedMessages = kafkaSource.flatMap(parser).name(”parser") val enrichedSessions = parsedMessages.filter(reflushFilter).name(”filter") .map(playbackEnrichment).name(”service") .map(dataLookup) enrichmentSessions.addSink(sink).name("sink")

@monaldax

slide-52
SLIDE 52
  • 4. The Co-process Joiner

@monaldax

slide-53
SLIDE 53
  • 4. Use Case – Play-Impressions Conversion Rate

@monaldax

slide-54
SLIDE 54
  • 130+ M members
  • 10+ B Impressions / day
  • 2.5+ B Play Events / day
  • ~ 2 TB Processing State
  • 4. Impressions And Plays Scale

@monaldax

slide-55
SLIDE 55

Streaming Job

  • # Impressions per user play
  • Impression attributes leading to the play
  • 4. Join Large Streams With Delayed, Out Of Order Events Based on

Event Time

@monaldax

P1 P3 I2 I1

Sink

Kafka Topics plays impressions

slide-56
SLIDE 56

Understanding Event Time

Image Adapted from The Apache Beam Presentation Material

Event Time Processing Time

11:00 10:00 15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00

Input Output

1 hour Window

slide-57
SLIDE 57

impressions Streaming Job keyBy F1

  • 4. Use Case: Join Impressions And Plays Stream On Event Time

Co-process

@monaldax

Kafka Topics keyBy F2 plays

P2 K

Keyed State

I1 K I2 K

Merge I2 P2 & Emit Merge I2 P2 & Emit

slide-58
SLIDE 58

Source 1 Streaming Job keyBy F1 Co-process State 1 State 2

@monaldax

keyBy F2 Source 2 Keyed State

Sink

  • 4. Pattern: The Co-process Joiner
  • Process and Coalesce events for each stream grouped by same key
  • Join if there is a match, evict when joined or timed out
slide-59
SLIDE 59
  • 4. Code Snippet – The Co-process Joiner, Setup sources

env.setStreamTimeCharacteristic(EventTime) val impressionSource = kafkaSrc1 .filter(eventTypeFilter) .flatMap(impressionParser) .keyBy(in => (s"${profile_id}_${title_id}"))

val impressionSource = kafkaSrc2 .flatMap(playbackParser) .keyBy(in => (s"${profile_id}_${title_id}"))

@monaldax

slide-60
SLIDE 60

env.setStreamTimeCharacteristic(EventTime)

val impressionSource = kafkaSrc1.filter(eventTypeFilter) .flatMap(impressionParser)

.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(10) ) {...})

.keyBy(in => (s"${profile_id}_${title_id}"))

val impressionSource = kafkaSrc2.flatMap(playbackParser) .assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(10) ) {...}) .keyBy(in => (s"${profile_id}_${title_id}"))

  • 4. Code Snippet – The Co-process Joiner, Setup sources
slide-61
SLIDE 61
  • 4. Code Snippet – The Co-process Joiner, Connect Streams

// Connect impressionSource.connect(playSessionSource) .process( new CoprocessImpressionsPlays()) .addSink(kafkaSink)

@monaldax

slide-62
SLIDE 62
  • 4. Code Snippet – The Co-process Joiner, Co-process Function

class CoprocessJoin extends CoProcessFunction {

  • verride def processElement1(value, context, collector) {

… // update and reduce state, join with stream 2, set timer }

  • verride def processElement2(value, context, collector) {

… // update and reduce state, join with stream 2, set timer }

  • verride def onTimer(timestamp, context, collector) {

… // clear up state based on event time }

@monaldax

slide-63
SLIDE 63
  • 5. Event-Sourced Materialized View

[Event Driven Application]

@monaldax

slide-64
SLIDE 64
  • 5. Use-case: Publish movie assets CDN location to Cache, to steer

clients to the closest location for playback

EvCache

asset1 OCA1, CA2 asset2 OCA1, CA3

Playback Assets Service

Service

asset_1 added asset_8 deleted

OCA OCA1 OCA OCAN Open Connect Appliances (CDN)

Upload asset_1 Delete asset_8

Streaming Job Source1

Generates Events to publish all assets

asset1 OCA1, CA2…. asset2 OCA1, CA3 …

@monaldax Materialized View

slide-65
SLIDE 65
  • 5. Use-case: Publish movie assets CDN location to Cache, to steer

clients to the closest location for playback

Streaming Job Event Publisher

Optional Trigger to flush view Materialized View

@monaldax Materialized View

Sink

slide-66
SLIDE 66
  • 5. Code Snippet – Setting up Sources

val fullPublishSource = env .addSource(new FullPublishSourceFunc7on(), TypeInfoParser.parse("Tuple3<String, Integer, com.ne7lix.AMUpdate>")) .setParallelism(1); val kaCaSource = getSourceBuilder().fromKaCa("am_source")

@monaldax

slide-67
SLIDE 67
  • 5. Code Snippet – Union Source & Processing

val kafkaSource .flatMap(new FlatmapFunction())) //split by movie

.assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[...]())

.union(fullPublishSource) // union with full publish source .keyBy(0, 1) // (cdn stack, movie) .process(new UpdateFunction())) // update in-memory state, output at intervals. .keyBy(0, 1) // (cdn stack, movie) .process(new PublishToEvCacheFunction())); // publish to evcache

@monaldax

slide-68
SLIDE 68

Patterns

Non-Functional

slide-69
SLIDE 69
  • 6. Elastic Dev Interface

@monaldax

slide-70
SLIDE 70

6 Elastic Dev Interface

Spectrum Of Ease, Capability, & Flexibility

  • Point & Click, with UDFs
  • SQL with UDFs
  • Annotation based API with code generation
  • Code with reusable components
  • (e.g., Data Hygiene, Script Transformer)

Ease of Use Capability

slide-71
SLIDE 71
  • 7. Stream Processing Platform

@monaldax

slide-72
SLIDE 72
  • 7. Stream Processing Platform (SpaaS - Stream Processing Service as a Service)

Amazon EC2

Container Runtime Reusable Components

Source & Sink Connectors, Filtering, Projection, etc.

Routers

(Streaming Job)

Streaming Jobs Management Service & UI Metrics & Monitoring Streaming Job Development Dashboards

@monaldax

Stream Processing Platform

(Streaming Engine, Config Management)

slide-73
SLIDE 73
  • 8. Rewind & Restatement

@monaldax

slide-74
SLIDE 74
  • 8. Use Case - Restate Results Due To Outage Or Bug In Business Logic

@monaldax

Time

Checkpoint y Checkpoint x

  • utage

Checkpoint x+1 Now

slide-75
SLIDE 75

Time

  • 8. Pattern: Rewind And Restatement

Rewind the source and state to a know good state

@monaldax

Checkpoint y Checkpoint x

  • utage

Checkpoint x+1 Now

slide-76
SLIDE 76

Summary

@monaldax

slide-77
SLIDE 77

1.

Configurable Router

2.

Script UDF Component

3.

The Enricher

4.

The Co-process Joiner

5.

Event-Sourced Materialized View

Patterns Summary

@monaldax

6.

Elastic Dev Interface

7.

Stream Processing Platform

8.

Rewind & Restatement

FUNCTIONAL NON-FUNCTIONAL

slide-78
SLIDE 78

Q & A

Thank you

If you would like to discuss more

  • @monaldax
  • linkedin.com/in/monaldax
slide-79
SLIDE 79
  • Flink at Netflix, Paypal speaker series, 2018– http://bit.ly/monal-paypal
  • Unbounded Data Processing Systems, Strangeloop, 2016 - http://bit.ly/monal-sloop
  • AWS Re-Invent 2017 Netflix Keystone SPaaS, 2017 – http://bit.ly/monal-reInvent
  • Keynote - Stream Processing with Flink, 2017 - http://bit.ly/monal-ff2017
  • Dataflow Paper - http://bit.ly/dataflow-paper

Additional Stream Processing Material

@monaldax