STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA - - PowerPoint PPT Presentation

stream processing at linkedin apache kafka apache samza
SMART_READER_LITE
LIVE PREVIEW

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA - - PowerPoint PPT Presentation

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure @ LinkedIn


slide-1
SLIDE 1

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA

Processing billions of events every day

slide-2
SLIDE 2

Neha Narkhede

¨ Co-founder and Head of Engineering @ Stealth

Startup

¨ Prior to this…

¤ Lead, Streams Infrastructure @ LinkedIn (Kafka &

Samza)

¤ One of the initial authors of Apache Kafka, committer

and PMC member

¨ Reach out at @nehanarkhede

slide-3
SLIDE 3

Agenda

¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing

slide-4
SLIDE 4

The Data Needs Pyramid

Physiological Safety Love/Belonging Esteem Self actualization Maslow's hierarchy of needs Data collection Data processing Understanding Automation Data needs

slide-5
SLIDE 5

Agenda

¨ Real-time Data Integration

¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing

slide-6
SLIDE 6

Increase in diversity of data

1980+ 2000+ 2010+ Siloed data feeds Database data (users, products, orders etc) IoT sensors Events (clicks, impressions, pageviews) Application logs (errors, service calls) Application metrics (CPU usage, requests/sec)

slide-7
SLIDE 7

Explosion in diversity of systems

¨ Live Systems

¤ Voldemort ¤ Espresso ¤ GraphDB ¤ Search ¤ Samza

¨ Batch

¤ Hadoop ¤ Teradata

slide-8
SLIDE 8

Data integration disaster

Oracle Oracle Oracle User Tracking Hadoop Log Search Monitoring Data Warehous e Social Graph Rec. Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics

Production Services ...

Security

slide-9
SLIDE 9

Centralized service

Oracle Oracle Oracle User Tracking Hadoop Log Search Monitorin g Data Warehous e Social Graph Rec Engine & Life Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics

Production Services ...

Security Data Pipeline

slide-10
SLIDE 10

Agenda

¨ Real-time Data Integration

¨ Introduction to Logs &

Apache Kafka

¨ Logs & Stream processing ¨ Apache Samza ¨ Stateful stream processing

slide-11
SLIDE 11

Kafka at 10,000 ft

¨ Distributed from

ground up

¨ Persistent ¨ Multi-subscriber

Cluster of brokers Producer Producer Producer Producer Producer Producer Producer Consumer Producer Consumer Producer Consumer

slide-12
SLIDE 12

Key design principles

¨ Scalability of a file system

¤ Hundreds of MB/sec/server throughput ¤ Many TBs per server

¨ Guarantees of a database

¤ Messages strictly ordered ¤ All data persistent

¨ Distributed by default

¤ Replication model ¤ Partitioning model

slide-13
SLIDE 13

Kafka adoption

slide-14
SLIDE 14

Apache Kafka @ LinkedIn

¨ 175 TB of in-flight log data per colo ¨ Low-latency: ~1.5ms ¨ Replicated to each datacenter ¨ Tens of thousands of data producers ¨ Thousands of consumers ¨ 7 million messages written/sec ¨ 35 million messages read/sec ¨ Hadoop integration

slide-15
SLIDE 15

The data structure every systems engineer should know

Logs

slide-16
SLIDE 16

The Log

¨ Ordered ¨ Append only ¨ Immutable

1 2 3 4 5 6 7 8 9 10 11 12 1st record next record written

slide-17
SLIDE 17

The Log: Partitioning

1 2 3 4 5 6 7 8 9 10 11 12 Partition 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12 Partition 1 Partition 2 13 14 15 16

slide-18
SLIDE 18

Logs: pub/sub done right

1 2 3 4 5 6 7 8 9 10 11 12 writes Data source Destination system A (time = 7) Destination system B (time = 11) reads reads

slide-19
SLIDE 19

Logs for data integration

User updates profile with new job Newsfeed KAFKA Search Hadoop Standardization engine

slide-20
SLIDE 20

Agenda

¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka

¨ Logs & Stream processing

¨ Apache Samza ¨ Stateful stream processing

slide-21
SLIDE 21

Stream processing = f(log)

Log A Job 1 Log B

slide-22
SLIDE 22

Stream processing = f(log)

Log A Job 1 Job 2 Log B Log C Log D Log E

slide-23
SLIDE 23

Apache Samza at LinkedIn

User updates profile with new job Newsfeed KAFKA Search Hadoop Standardization engine

slide-24
SLIDE 24

Latency spectrum of data systems

Synchronous (milliseconds)

RPC

Batch (Hours)

Latency

Asynchronous processing (seconds to minutes)

slide-25
SLIDE 25

Agenda

¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing

¨ Apache Samza

¨ Stateful stream processing

slide-26
SLIDE 26

Samza API

public interface StreamTask { void process (IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator); }

getKey(), getMsg() sendMsg(topic, key, value) commit(), shutdown()

slide-27
SLIDE 27

Samza Architecture (Logical view)

Task 1 Task 2 Task 3

Log A Log B

partition 0 partition 1 partition 2 partition 0 partition 1

slide-28
SLIDE 28

Samza Architecture (Logical view)

Task 1 Task 2 Task 3

Log A Log B

partition 0 partition 1 partition 2 partition 0 partition 1

Samza container 1 Samza container 2

slide-29
SLIDE 29

Samza Architecture (Physical view)

Samza container 1 Samza container 2 Host 1 Host 2

slide-30
SLIDE 30

Samza Architecture (Physical view)

Samza container 1 Samza container 2 Host 1 Host 2 Samza YARN AM Node manager Node manager

slide-31
SLIDE 31

Samza Architecture (Physical view)

Samza container 1 Samza container 2 Host 1 Host 2 Samza YARN AM Node manager Node manager Kafka Kafka

slide-32
SLIDE 32

Map Reduce Map Reduce YARN AM Node manager Node manager HDFS HDFS Host 1 Host 2

Samza Architecture: Equivalence to Map Reduce

slide-33
SLIDE 33

M/R Operation Primitives

¨ Filter

records matching some condition

¨ Map

record = f(record)

¨ Join

Two/more datasets by key

¨ Group

records with same key

¨ Aggregate

f(records within the same group)

¨ Pipe

job 1’s output => job 2’s input

slide-34
SLIDE 34

M/R Operation Primitives on streams

¨ Filter

records matching some condition

¨ Map

record = f(record)

¨ Join

Two/more datasets by key

¨ Group

records with same key

¨ Aggregate

f(records within the same group)

¨ Pipe

job 1’s output => job 2’s input Requires state maintenance

slide-35
SLIDE 35

Agenda

¨ Real-time Data Integration ¨ Introduction to Logs & Apache Kafka ¨ Logs & Stream processing ¨ Apache Samza

¨ Stateful stream processing

slide-36
SLIDE 36

Example: Newsfeed

User 567 posted "Hello World" Status update log Fan out messages to followers Push notification log 567 -> [123, 679, 789, ...] 999 -> [156, 343, ... ] User 989 posted "Blah Blah" User ... posted "..." External connection DB Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user ...'s newsfeed

slide-37
SLIDE 37

Disk

100-500K msg/sec/node 100-500K msg/sec/node 1-5K queries/sec ?? ex: Cassandra, MongoDB, etc

Remote state

Samza task partition 0 Samza task partition 1

Local state vs Remote state: Remote

❌ Performance ❌ Isolation ❌ Limited APIs

slide-38
SLIDE 38

Local

LevelDB/RocksDB Samza task partition 0 Samza task partition 1

Local

LevelDB/RocksDB

Local state: Bring data closer to computation

slide-39
SLIDE 39

Local

LevelDB/RocksDB Samza task partition 0 Samza task partition 1

Local

LevelDB/RocksDB

Local state: Bring data closer to computation

Disk

Change log stream

slide-40
SLIDE 40

Example Revisited: Newsfeed

User 567 posted "Hello World" Status update log New connection log Fan out messages to followers Push notification log 567 -> [123, 679, 789, ...] 999 -> [156, 343, ... ] User 123 followed 567 User 890 followed 234 User ... followed ... User 989 posted "Blah Blah" User ... posted "..." Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user ...'s newsfeed

slide-41
SLIDE 41

Fault tolerance?

Samza container 1 Samza container 2 Host 1 Host 2 Samza YARN AM Node manager Node manager Kafka Kafka

slide-42
SLIDE 42

Local

LevelDB/RocksDB Samza task partition 0 Samza task partition 1

Local

LevelDB/RocksDB Durable change log

Fault tolerance in Samza

slide-43
SLIDE 43

Slow jobs

Log A Job 1 Job 2 Log B Log C Log D Log E

❌ Drop data ❌ Backpressure ❌ Queue

❌ In memory ✅ On disk (KAFKA)

slide-44
SLIDE 44

Summary

¨ Real time data integration is crucial for the success

and adoption of stream processing

¨ Logs form the basis for real time data integration ¨ Stream processing = f(logs) ¨ Samza is designed from ground-up for scalability

and provides fault-tolerant, persistent state

slide-45
SLIDE 45

Thank you!

¨ The Log

¤ http://bit.ly/the_log

¨ Apache Kafka

¤ http://kafka.apache.org

¨ Apache Samza

¤ http://samza.incubator.apache.org

¨ Me

¤ @nehanarkhede ¤ http://www.linkedin.com/in/nehanarkhede