STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA - PowerPoint PPT Presentation

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day

Neha Narkhede ¨ Co-founder and Head of Engineering @ Stealth Startup ¨ Prior to this… ¤ Lead, Streams Infrastructure @ LinkedIn (Kafka & Samza) ¤ One of the initial authors of Apache Kafka, committer and PMC member ¨ Reach out at @nehanarkhede

Agenda ¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing

The Data Needs Pyramid Self Automation actualization Esteem Understanding Love/Belonging Data processing Safety Data collection Physiological Data needs Maslow's hierarchy of needs

Increase in diversity of data Database data (users, products, orders etc) 1980+ Events (clicks, impressions, pageviews) Application logs (errors, service calls) 2000+ Application metrics (CPU usage, requests/sec) Siloed data feeds 2010+ IoT sensors

Explosion in diversity of systems ¨ Live Systems ¤ Voldemort ¤ Espresso ¤ GraphDB ¤ Search ¤ Samza ¨ Batch ¤ Hadoop ¤ Teradata

Data integration disaster Espresso Operational Voldemort Oracle Logs User Tracking Espresso Voldemort Oracle Metrics Espresso Voldemort Oracle Data Log Social Rec. ... Hadoop Monitoring Warehous Search Security Email Search Graph Engine e Production Services

Centralized service Espresso Operational Voldemort Oracle Logs User Tracking Espresso Voldemort Oracle Metrics Espresso Voldemort Oracle Data Pipeline Data Rec Log Monitorin Social ... Hadoop Warehous Search Security Email Engine & Search g Graph e Life Production Services

Kafka at 10,000 ft Producer Producer Producer Producer Producer Producer ¨ Distributed from ground up ¨ Persistent Cluster of brokers ¨ Multi-subscriber Producer Producer Producer Consumer Consumer Consumer

Key design principles ¨ Scalability of a file system ¤ Hundreds of MB/sec/server throughput ¤ Many TBs per server ¨ Guarantees of a database ¤ Messages strictly ordered ¤ All data persistent ¨ Distributed by default ¤ Replication model ¤ Partitioning model

Kafka adoption

Apache Kafka @ LinkedIn ¨ 175 TB of in-flight log data per colo ¨ Low-latency: ~1.5ms ¨ Replicated to each datacenter ¨ Tens of thousands of data producers ¨ Thousands of consumers ¨ 7 million messages written/sec ¨ 35 million messages read/sec ¨ Hadoop integration

Logs The data structure every systems engineer should know

The Log 1st record next record written ¨ Ordered ¨ Append only 0 1 2 3 4 5 6 7 8 9 10 11 12 ¨ Immutable

The Log: Partitioning Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Partition 1 0 1 2 3 4 5 6 7 8 9 Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Logs: pub/sub done right Data source writes 0 1 2 3 4 5 6 7 8 9 10 11 12 reads reads Destination Destination system B system A (time = 11) (time = 7)

Logs for data integration User updates profile with new job KAFKA Standardization Hadoop Search Newsfeed engine

Stream processing = f(log) Log A Job 1 Log B

Stream processing = f(log) Log A Job 1 Log B Log C Job 2 Log D Log E

Apache Samza at LinkedIn User updates profile with new job KAFKA Standardization Hadoop Search Newsfeed engine

Latency spectrum of data systems RPC Latency Asynchronous processing (seconds to minutes) Batch (Hours) Synchronous (milliseconds)

Samza API getKey(), getMsg() public interface StreamTask { void process ( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator); } sendMsg(topic, key, value) commit(), shutdown()

Samza Architecture (Logical view) Log A partition 2 partition 1 partition 0 Task 1 Task 2 Task 3 partition 1 partition 0 Log B

Samza Architecture (Logical view) Log A partition 2 partition 1 partition 0 Samza container 1 Samza container 2 Task 1 Task 2 Task 3 partition 1 partition 0 Log B

Samza Architecture (Physical view) Samza container 1 Samza container 2 Host 1 Host 2

Samza Architecture (Physical view) Node manager Node manager Samza Samza container 1 Samza container 2 YARN AM Host 1 Host 2

Samza Architecture (Physical view) Node manager Node manager Samza Samza container 1 Samza container 2 YARN AM Kafka Kafka Host 1 Host 2

Samza Architecture: Equivalence to Map Reduce Node manager Node manager Map Reduce Map Reduce YARN AM HDFS HDFS Host 1 Host 2

M/R Operation Primitives ¨ Filter records matching some condition ¨ Map record = f(record) ¨ Join Two/more datasets by key ¨ Group records with same key ¨ Aggregate f(records within the same group) ¨ Pipe job 1’s output => job 2’s input

M/R Operation Primitives on streams ¨ Filter records matching some condition ¨ Map record = f(record) ¨ Join Two/more datasets by key ¨ Group records with same key ¨ Aggregate f(records within the same group) ¨ Pipe job 1’s output => job 2’s input Requires state maintenance

Example: Newsfeed User ... posted "..." User 989 posted "Blah Blah" User 567 posted "Hello World" Status update log External connection DB Fan out messages to followers 567 -> [123, 679, 789, ...] 999 -> [156, 343, ... ] Push notification log Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user ...'s newsfeed

Local state vs Remote state: Remote 100-500K msg/sec/node 100-500K msg/sec/node Samza task Samza task partition 0 partition 1 ❌ Performance Remote state ❌ Isolation 1-5K queries/sec ?? ❌ Limited APIs ex: Cassandra, MongoDB, etc Disk

Local state: Bring data closer to computation Samza task Samza task partition 0 partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB

Local state: Bring data closer to computation Samza task Samza task partition 0 partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB Change log stream Disk

Example Revisited: Newsfeed User ... posted "..." User ... followed ... User 989 posted "Blah Blah" User 123 followed 567 User 567 posted "Hello World" User 890 followed 234 Status update log New connection log Fan out messages to followers 567 -> [123, 679, 789, ...] 999 -> [156, 343, ... ] Push notification log Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user ...'s newsfeed

Fault tolerance? Node manager Node manager Samza Samza container 1 Samza container 2 YARN AM Kafka Kafka Host 1 Host 2

Fault tolerance in Samza Samza task Samza task partition 0 partition 1 Local Local LevelDB/RocksDB LevelDB/RocksDB Durable change log

Slow jobs Log A Job 1 Log B Log C ❌ Drop data Job 2 ❌ Backpressure ❌ Queue ❌ In memory Log D Log E ✅ On disk (KAFKA)

Summary ¨ Real time data integration is crucial for the success and adoption of stream processing ¨ Logs form the basis for real time data integration ¨ Stream processing = f(logs) ¨ Samza is designed from ground-up for scalability and provides fault-tolerant, persistent state

Thank you! ¨ The Log ¤ http://bit.ly/the_log ¨ Apache Kafka ¤ http://kafka.apache.org ¨ Apache Samza ¤ http://samza.incubator.apache.org ¨ Me ¤ @nehanarkhede ¤ http://www.linkedin.com/in/nehanarkhede

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA - PowerPoint PPT Presentation

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure @ LinkedIn

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data

Apache Kafka + Apache Mesos Highly Scalable Streaming Microservices with Kafka Streams Kai

FROM HTTP TO KAFKA-BASED FROM HTTP TO KAFKA-BASED MICROSERVICES MICROSERVICES Wojciech Rzsa,

Apache Kafka Real-Time Data Pipelines http://kafka.apache.org/ Joe Stein Developer,

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Auto-sizing for Stream Processing Applications at LinkedIn Rayman Preet Singh, Bharath

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Cr Cruise Co Control: Effo l: Effortle less M Manage gement o of K f Kafka fka Clu

Blockchain consensus Protocols in the Wild Tao Wang, Lihang Pan ECS 265 Apache Kafka

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Kafka Needs No Keeper Colin McCabe 2 Introduction Kafka has gotten its mileage out of

READING KAFKA IN QATAR Qatar-TESOL Conference, April 2011 Magdalena Rostron Academic Bridge

Kafka in Jail Running Kafka in container orchestrated clusters Sean Glover, Lightbend @seg1o

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

Make Your UI Tests Resilient with the Next Generation of

KARISMA Machine Presentation INTRODUCING KARISMA KARISMA IS NOT LIKE ITALIAN COFFEE, IT IS

Espresso Somdeep Dey Rohit Gurunath Jianfeng Qian Oliver Willens Overview Introduction

Instan(t)a-neous Monitoring Instan(t)a-neous Monitoring Have You Ever Had The Feeling You Wanted

HTCondor Training Florentia Protopsalti IT-CM-IS 5/12/2017 2 Overview HTCondor Batch System

FURNITURE PRESENTATION Indoff business interiors FLOOR PLAN Guest seating, office, conf, bar

WELCOME TO ST. MICHAELS PARENT WORKSHOP James Grimsby Deputy Head Mairi Fraser EY/KS1

W A RW ICK P A C O - p G R O U P u o r G OCAPL Night k c i w March 2018 r a W

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA - PowerPoint PPT Presentation

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure @ LinkedIn

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data

Apache Kafka + Apache Mesos Highly Scalable Streaming Microservices with Kafka Streams Kai

FROM HTTP TO KAFKA-BASED FROM HTTP TO KAFKA-BASED MICROSERVICES MICROSERVICES Wojciech Rzsa,

Apache Kafka Real-Time Data Pipelines http://kafka.apache.org/ Joe Stein Developer,

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Auto-sizing for Stream Processing Applications at LinkedIn Rayman Preet Singh, Bharath

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Cr Cruise Co Control: Effo l: Effortle less M Manage gement o of K f Kafka fka Clu

Blockchain consensus Protocols in the Wild Tao Wang, Lihang Pan ECS 265 Apache Kafka

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Kafka Needs No Keeper Colin McCabe 2 Introduction Kafka has gotten its mileage out of

READING KAFKA IN QATAR Qatar-TESOL Conference, April 2011 Magdalena Rostron Academic Bridge

Kafka in Jail Running Kafka in container orchestrated clusters Sean Glover, Lightbend @seg1o

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

Make Your UI Tests Resilient with the Next Generation of

KARISMA Machine Presentation INTRODUCING KARISMA KARISMA IS NOT LIKE ITALIAN COFFEE, IT IS

Espresso Somdeep Dey Rohit Gurunath Jianfeng Qian Oliver Willens Overview Introduction

Instan(t)a-neous Monitoring Instan(t)a-neous Monitoring Have You Ever Had The Feeling You Wanted

HTCondor Training Florentia Protopsalti IT-CM-IS 5/12/2017 2 Overview HTCondor Batch System

FURNITURE PRESENTATION Indoff business interiors FLOOR PLAN Guest seating, office, conf, bar

WELCOME TO ST. MICHAELS PARENT WORKSHOP James Grimsby Deputy Head Mairi Fraser EY/KS1

W A RW ICK P A C O - p G R O U P u o r G OCAPL Night k c i w March 2018 r a W

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri