Apache Kafka, Stream Processing, and Real-time Data Jay Kreps
I Logs Apache Kafka, Stream Processing, and Real-time Data Jay - - PowerPoint PPT Presentation
I Logs Apache Kafka, Stream Processing, and Real-time Data Jay - - PowerPoint PPT Presentation
I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing Data Integration
The Plan
- 1. What is Data Integration?
- 2. What is Apache Kafka?
- 3. Logs and Distributed Systems
- 4. Logs and Data Integration
- 5. Logs and Stream Processing
Data Integration
Maslow’s Hierarchy
Physiological Safety Love & Belonging Esteem Self-Actualization
For Data
Acquisition/Collection Semantics Understanding Automation
New Types of Data
- Database data
– Users, products, orders, etc
- Events
– Clicks, Impressions, Pageviews, etc
- Application metrics
– CPU usage, requests/sec
- Application logs
– Service calls, errors
New Types of Systems
- Live Stores
– Voldemort – Espresso – Graph – OLAP – Search – InGraphs
- Offline
– Hadoop – Teradata
Bad
Oracle Oracle Oracle User Tracking Hadoop Log Search Monitoring Data Warehouse Social Graph Rec. Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics
Production Services ...
Security
Good
Oracle Oracle Oracle User Tracking Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics
Production Services ...
Security Log
The Plan
- 1. What is Data Integration?
- 2. What is Apache Kafka?
- 3. Logs and Distributed Systems
- 4. Logs and Data Integration
- 5. Logs and Stream Processing
Apache Kafka
producer kafka cluster producer producer producer producer producer producer producer producer consumer consumer consumer consumer consumer consumer consumer consumer consumer
A brief history
- f
Kafka
Three design principles
- 1. One pipeline to rule them all
- 2. Stream processing >> messaging
- 3. Clusters not servers
Characteristics
- Scalability of a filesystem
– Hundreds of MB/sec/server throughput – Many TB per server
- Guarantees of a database
– Messages strictly ordered – All data persistent
- Distributed by default
– Replication – Partitioning model
Kafka At LinkedIn
- 175 TB of in-flight log data per colo
- Low-latency: ~1.5 ms
- Replicated to each datacenter
- Tens of thousands of data producers
- Thousands of consumers
- 7 million messages written/sec
- 35 million messages read/sec
- Hadoop integration
The Plan
- 1. What is Data Integration?
- 2. What is Apache Kafka?
- 3. Logs and Distributed Systems
- 4. Logs and Data Integration
- 5. Logs and Stream Processing
Kafka is about logs
What is a log?
1 2 3 4 5 6 7 8 9 10 11 Next Record Written 12 1st Record
Partitioning
1 2 3 4 5 6 7 8 9 1 1 1 1 2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 1 1 Partition 0 Partition 1 Partition 2
Logs: pub/sub done right
1 2 3 4 5 6 7 8 9 1 1 1 1 2 Log
Data Source
writes
Destination System A (time = 7)
reads
Destination System B (time = 11)
reads
Logs And Distributed Systems
Example: A Fault-tolerant CEO Hash Table
Replica 1 Replica 2
PUT('microsoft', 'bill gates') PUT('apple', 'steve jobs') PUT('microsoft', 'steve ballmer') PUT('google', 'larry page') PUT('yahoo', 'terry semel') PUT('google', 'eric schmidt') PUT('yahoo', 'jerry yang') PUT('yahoo', 'carol bartz') PUT('apple', 'tim cook') PUT('google', 'larry page') PUT('yahoo', 'scott thompson') PUT('yahoo', 'marissa mayer') PUT('microsoft', 'satya nadella')
{ 'microsoft': 'satya nadella', 'apple': 'tim cook', 'google': 'larry page', 'yahoo': 'marissa mayer' }
Operations Final State
1 2 3 4 5 6 7 8
PUT(microsoft, bill gates) PUT(apple, steve jobs) PUT(microsoft, steve ballmer) PUT(google, larry page) PUT(yahoo, terry semel) PUT(google, eric schmidt) PUT(yahoo, jerry yang) PUT(yahoo, carol bartz) PUT(apple, tim cook) Replica 1 (offset=10) Replica 2 (offset=12) PUT(google, larry page) PUT(yahoo, scott thompson)
9 10 11 12
PUT(yahoo, marissa mayer) PUT(microsoft, satya nadella)
State-machine Replication
Peer Peer Peer The Log Writes Reads
Master Slave Slave The Log
state changes
Requests
Primary-backup
Two System Design Styles
The Plan
- 1. What is Data Integration?
- 2. What is Apache Kafka?
- 3. Logs and Distributed Systems
- 4. Logs and Data Integration
- 5. Logs and Stream Processing
Example: User views job
Jobs Frontend Kafka
Job Views
Hadoop Security Job Poster Analytics Rec. Engine Monitoring
Job Views
It’s all one big distributed system
Oracle Oracle Oracle User Tracking Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics
Production Services ...
Security Log
Comparing Data Transfer Mechanisms
The Plan
- 1. What is Data Integration?
- 2. What is Apache Kafka?
- 3. Logs and Distributed Systems
- 4. Logs and Data Integration
- 5. Logs and Stream Processing
Stream Processing
Stream Processing = Logs + Jobs
Job 1 Job 3 Log A Log B Log C Job 2 Log E Log D Log F
Stream processing is a generalization
- f batch processing
Examples
- Monitoring
- Security
- Content processing
- Recommendations
- Newsfeed
- ETL
Systems Can Help
Samza Architecture
Job Job Job Job Samza Kafka YARN
Log-centric Architecture
Key-Value Query Layer Search Query Layer Stream Proces sing
Log
Hadoop Monitoring & Graphs Graph DB, OLAP Store, Etc
Kafka http://kafka.apache.org
- Samza
http://samza.incubator.apache.org
- Log Blog
http://linkd.in/199iMwY
- Me
http://www.linkedin.com/in/jaykreps @jaykreps