Streaming Why should I care? Christian Trebing Blue Yonder GmbH - PowerPoint PPT Presentation

Streaming Why should I care? Christian Trebing Blue Yonder GmbH @ctrebing 1

Agenda Motivation Streaming Intro Implementation Challenges 2

Data Processing - The Monolith Data Input Data Validation THE Database Machine Learning Data Output 3

Pain Several teams are developing this application • Customer desperately wants new feature in machine learning • But the data validation team is in the midst of refactoring their database structure (‚will be fj nished in two weeks‘) • So you wait… 4

Could Microservices Help? Great: Data Input Data • No Dependency on single state • Independent Development Data Validation Data • Independent Upgrades Di ffi cult: Machine Learning Data • Too much data to transfer • Too much data to store in each Data Output Data service 5

Are There Other Possibilities? 6

Streaming Intro 7

Databases and Streams Same information in table and stream: A=1 B=5 C=3 A=8 A=4 C=2 A: 1 A: 1 A: 8 A: 4 A: 4 A: 1 B: 5 B: 5 B: 5 B: 5 B: 5 C: 3 C: 3 C: 3 C: 2 8

Why does it matter? Di ff erent services can be in di ff erent states • Each service can consume the stream in its own speed • One service can be updated while the other runs A=1 B=5 C=3 A=8 A=4 C=2 A: 1 A: 1 A: 8 A: 4 A: 4 A: 1 B: 5 B: 5 B: 5 B: 5 B: 5 C: 3 C: 3 C: 3 C: 2 Service 1 Service 2 on index 3 on index 5 9

Partitioned Streams Sales stream, partitioned by location Each partition could be handled by di ff . processors location: Rimini location: Rimini location: Rimini location: Rimini product: Spaghetti product: Ravioli product: Pizza product: Spaghetti sales_date: 2017-07-10 sales_date: 2017-07-10 sales_date: 2017-07-11 sales_date: 2017-07-11 quantity: 5 quantity: 8 quantity: 1 quantity: 7 location: Bilbao location: Bilbao location: Bilbao product: Pizza product: Ravioli product: Spaghetti sales_date: 2017-07-10 sales_date: 2017-07-11 sales_date: 2017-07-11 quantity: 3 quantity: 9 quantity: 5 location: Karlsruhe location: Karlsruhe location: Karlsruhe location: Karlsruhe product: Pizza product: Ravioli product: Ravioli product: Spaghetti sales_date: 2017-07-10 sales_date: 2017-07-10 sales_date: 2017-07-11 sales_date: 2017-07-11 quantity: 8 quantity: 3 quantity: 7 quantity: 7 10

Data Processing - With Streaming Data Input Data Data Validation Data Streaming Platform Data Output Data Machine Learning Data 11

What did we gain? Independent Development Independent Upgrade Scalability Did we throw out databases completely? • Let’s see… 12

Is it magic? No, it’s a tradeo ff : • A database is so powerful: ACID guarantess, SQL language. You can do nearly everything • This comes at a price • Dependency on single state • Scaling is hard So let’s more think what we really need 13

What do we loose? Database Stream ACID Ordering on stream partition Service is responsible of keeping SQL Queries its state You have to decide whether you can live with that. 14

Implementation 15

Apache Kafka 16

Kafka Clients in Python • pykafka, python-kafka, con fm uent-kafka-client • Nice comparison has been done here: http://activisiongamescience.github.io/2016/06/15/Kafka- • Client-Benchmarking/ • Most performant currently is con fm uent-kafka-client • Uses the c library librdkafka 17

Producer from con fm uent_kafka import Producer p = Producer({'bootstrap.servers': 'mybroker,mybroker2'}) for data in some_data_source: p.produce('mytopic', data.encode('utf-8')) p. fm ush() 18

Consumer 19

Apache Avro Data Serialization, Enabling Schema Evolution Clearly de fj ned schema with: • Schema evolution • Schema registry Writer’s schema Reader’s schema Data Type Field Name Data Type Field Name string location string location string product string product string sales_date string sales_date int quantity int quantity int, default=0 delivery_id 20

Avro Schema 21

AvroProducer 22

AvroConsumer 23

Example: Data Validation Separate valid and invalid sales records 24

Additional Processors Need to evolve your application: • Add processors for evaluation topics • Try new variant of validation logic database • remember processing state, same for each processor streaming • o ff set in each processor, can work independently 25

Challenge - Machine Learning Still Batch 26

How to get the input data? Remember: There is no possibility to query a stream. Somewhere all this data needs to be. Options: • Memory of the service • Serving database • Blob store Yes, that’s duplication. We’ll have to live with it. 27

Write Path / Read Path machine learning data validation THE database machine learning query write path read path machine learning data validation Blob store machine learning query write path read path 28

Machine Learning Input Data locations_validated sales_validated products_validated table table join append to file 29

Challenge - State in Processors 30

State - Nightmare of every distributed systems engineer Streaming: Data just rushes through Why do we need state? • Time window processing • Data you want to join with Formerly, the database did it for you 31

State - Some Challenges? Failure of a processor Scaling 32

State in Stream Processors - Possible Solutions • Just keep in memory. Reprocess stream to warm up • Each processor to keep its own db • Save condensed in stream • Get it from other service Frameworks exist in other languages: • for example: Kafka Streams, Apache Samza Up to yesterday: none in python. Then heard about https://github.com/wintoncode/winton-kafka-streams 33

Summary • You have more options for your data processing applications than you might have thought • As always, there are some tradeo ff s • You know the challenges 34

Questions? 35

Blue Yonder Best decisions, delivered daily Blue Yonder GmbH Blue Yonder Software Limited Blue Yonder Analytics, Inc. Ohiostraße 8 19 Eastbourne Terrace 5048 Tennyson Parkway 76149 Karlsruhe London, W2 6LG Suite 250 Germany United Kingdom Plano, Texas 75024 +49 721 383117 0 +44 20 3626 0360 USA 36

Streaming Why should I care? Christian Trebing Blue Yonder GmbH - PowerPoint PPT Presentation

Streaming Why should I care? Christian Trebing Blue Yonder GmbH @ctrebing 1 Agenda Motivation Streaming Intro Implementation Challenges 2 Data Processing - The Monolith Data Input Data Validation THE Database Machine Learning Data

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Modern Fast Streaming Data Todd L. Montgomery @toddlmontgomery Why Should We Care? Myths &

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Should it stay or should it go? Mark Galtrey www.falcon-chambers.co.uk www.falcon-chambers.co.uk

Topic Topic Why should historians of science/STS Why should historians of science/STS

Why should we care about history of logic? Dr. Sara L. Uckelman s.l.uckelman@durham.ac.uk

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Accessibility Designing for the full range of human capabilities. 1 CS349 -- Accessibility 3

IRONY: a contrast/difference of what happens and what is expected - typically for humor

A system for interactive display and rendering of BRDF models Adri Fors Herranz Advisors:

Animation & Cross-Media Drew Davidson Director Entertainment Technology Center

Chapter 5, Object Modeling From use cases to class diagrams Model and reality

Podcast: Ch05-04 Title : Some General Advice Description : general advice; who uses class

1 What is a good model? Models of models of models Intuitively: A model is good if

Nonparametric Methods Marc H. Mehlman marcmehlman@yahoo.com University of New Haven

Streaming Why should I care? Christian Trebing Blue Yonder GmbH - PowerPoint PPT Presentation

Streaming Why should I care? Christian Trebing Blue Yonder GmbH @ctrebing 1 Agenda Motivation Streaming Intro Implementation Challenges 2 Data Processing - The Monolith Data Input Data Validation THE Database Machine Learning Data

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Modern Fast Streaming Data Todd L. Montgomery @toddlmontgomery Why Should We Care? Myths &amp;

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

Should it stay or should it go? Mark Galtrey www.falcon-chambers.co.uk www.falcon-chambers.co.uk

Topic Topic Why should historians of science/STS Why should historians of science/STS

Why should we care about history of logic? Dr. Sara L. Uckelman s.l.uckelman@durham.ac.uk

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Accessibility Designing for the full range of human capabilities. 1 CS349 -- Accessibility 3

IRONY: a contrast/difference of what happens and what is expected - typically for humor

A system for interactive display and rendering of BRDF models Adri Fors Herranz Advisors:

Animation &amp; Cross-Media Drew Davidson Director Entertainment Technology Center

Chapter 5, Object Modeling From use cases to class diagrams Model and reality

Podcast: Ch05-04 Title : Some General Advice Description : general advice; who uses class

1 What is a good model? Models of models of models Intuitively: A model is good if

Nonparametric Methods Marc H. Mehlman marcmehlman@yahoo.com University of New Haven

Modern Fast Streaming Data Todd L. Montgomery @toddlmontgomery Why Should We Care? Myths &

Animation & Cross-Media Drew Davidson Director Entertainment Technology Center