I Logs Apache Kafka, Stream Processing, and Real-time Data Jay - - PowerPoint PPT Presentation

i logs
SMART_READER_LITE
LIVE PREVIEW

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay - - PowerPoint PPT Presentation

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs and Stream Processing Data Integration


slide-1
SLIDE 1

Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

I ♥ Logs

slide-2
SLIDE 2

The Plan

  • 1. What is Data Integration?
  • 2. What is Apache Kafka?
  • 3. Logs and Distributed Systems
  • 4. Logs and Data Integration
  • 5. Logs and Stream Processing
slide-3
SLIDE 3

Data Integration

slide-4
SLIDE 4

Maslow’s Hierarchy

Physiological Safety Love & Belonging Esteem Self-Actualization

slide-5
SLIDE 5

For Data

Acquisition/Collection Semantics Understanding Automation

slide-6
SLIDE 6

New Types of Data

  • Database data

– Users, products, orders, etc

  • Events

– Clicks, Impressions, Pageviews, etc

  • Application metrics

– CPU usage, requests/sec

  • Application logs

– Service calls, errors

slide-7
SLIDE 7

New Types of Systems

  • Live Stores

– Voldemort – Espresso – Graph – OLAP – Search – InGraphs

  • Offline

– Hadoop – Teradata

slide-8
SLIDE 8

Bad

Oracle Oracle Oracle User Tracking Hadoop Log Search Monitoring Data Warehouse Social Graph Rec. Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics

Production Services ...

Security

slide-9
SLIDE 9

Good

Oracle Oracle Oracle User Tracking Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics

Production Services ...

Security Log

slide-10
SLIDE 10

The Plan

  • 1. What is Data Integration?
  • 2. What is Apache Kafka?
  • 3. Logs and Distributed Systems
  • 4. Logs and Data Integration
  • 5. Logs and Stream Processing
slide-11
SLIDE 11

Apache Kafka

producer kafka cluster producer producer producer producer producer producer producer producer consumer consumer consumer consumer consumer consumer consumer consumer consumer

slide-12
SLIDE 12
slide-13
SLIDE 13

A brief history

  • f

Kafka

slide-14
SLIDE 14

Three design principles

  • 1. One pipeline to rule them all
  • 2. Stream processing >> messaging
  • 3. Clusters not servers
slide-15
SLIDE 15

Characteristics

  • Scalability of a filesystem

– Hundreds of MB/sec/server throughput – Many TB per server

  • Guarantees of a database

– Messages strictly ordered – All data persistent

  • Distributed by default

– Replication – Partitioning model

slide-16
SLIDE 16

Kafka At LinkedIn

  • 175 TB of in-flight log data per colo
  • Low-latency: ~1.5 ms
  • Replicated to each datacenter
  • Tens of thousands of data producers
  • Thousands of consumers
  • 7 million messages written/sec
  • 35 million messages read/sec
  • Hadoop integration
slide-17
SLIDE 17

The Plan

  • 1. What is Data Integration?
  • 2. What is Apache Kafka?
  • 3. Logs and Distributed Systems
  • 4. Logs and Data Integration
  • 5. Logs and Stream Processing
slide-18
SLIDE 18

Kafka is about logs

slide-19
SLIDE 19

What is a log?

slide-20
SLIDE 20
slide-21
SLIDE 21

1 2 3 4 5 6 7 8 9 10 11 Next Record Written 12 1st Record

slide-22
SLIDE 22

Partitioning

1 2 3 4 5 6 7 8 9 1 1 1 1 2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 1 1 Partition 0 Partition 1 Partition 2

slide-23
SLIDE 23

Logs: pub/sub done right

1 2 3 4 5 6 7 8 9 1 1 1 1 2 Log

Data Source

writes

Destination System A (time = 7)

reads

Destination System B (time = 11)

reads

slide-24
SLIDE 24

Logs And Distributed Systems

slide-25
SLIDE 25

Example: A Fault-tolerant CEO Hash Table

slide-26
SLIDE 26

Replica 1 Replica 2

PUT('microsoft', 'bill gates') PUT('apple', 'steve jobs') PUT('microsoft', 'steve ballmer') PUT('google', 'larry page') PUT('yahoo', 'terry semel') PUT('google', 'eric schmidt') PUT('yahoo', 'jerry yang') PUT('yahoo', 'carol bartz') PUT('apple', 'tim cook') PUT('google', 'larry page') PUT('yahoo', 'scott thompson') PUT('yahoo', 'marissa mayer') PUT('microsoft', 'satya nadella')

{ 'microsoft': 'satya nadella', 'apple': 'tim cook', 'google': 'larry page', 'yahoo': 'marissa mayer' }

Operations Final State

slide-27
SLIDE 27

1 2 3 4 5 6 7 8

PUT(microsoft, bill gates) PUT(apple, steve jobs) PUT(microsoft, steve ballmer) PUT(google, larry page) PUT(yahoo, terry semel) PUT(google, eric schmidt) PUT(yahoo, jerry yang) PUT(yahoo, carol bartz) PUT(apple, tim cook) Replica 1 (offset=10) Replica 2 (offset=12) PUT(google, larry page) PUT(yahoo, scott thompson)

9 10 11 12

PUT(yahoo, marissa mayer) PUT(microsoft, satya nadella)

slide-28
SLIDE 28

State-machine Replication

Peer Peer Peer The Log Writes Reads

Master Slave Slave The Log

state changes

Requests

Primary-backup

Two System Design Styles

slide-29
SLIDE 29

The Plan

  • 1. What is Data Integration?
  • 2. What is Apache Kafka?
  • 3. Logs and Distributed Systems
  • 4. Logs and Data Integration
  • 5. Logs and Stream Processing
slide-30
SLIDE 30

Example: User views job

Jobs Frontend Kafka

Job Views

Hadoop Security Job Poster Analytics Rec. Engine Monitoring

Job Views

slide-31
SLIDE 31

It’s all one big distributed system

Oracle Oracle Oracle User Tracking Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Operational Logs Operational Metrics

Production Services ...

Security Log

slide-32
SLIDE 32

Comparing Data Transfer Mechanisms

slide-33
SLIDE 33

The Plan

  • 1. What is Data Integration?
  • 2. What is Apache Kafka?
  • 3. Logs and Distributed Systems
  • 4. Logs and Data Integration
  • 5. Logs and Stream Processing
slide-34
SLIDE 34

Stream Processing

slide-35
SLIDE 35
slide-36
SLIDE 36

Stream Processing = Logs + Jobs

Job 1 Job 3 Log A Log B Log C Job 2 Log E Log D Log F

slide-37
SLIDE 37

Stream processing is a generalization

  • f batch processing
slide-38
SLIDE 38

Examples

  • Monitoring
  • Security
  • Content processing
  • Recommendations
  • Newsfeed
  • ETL
slide-39
SLIDE 39

Systems Can Help

slide-40
SLIDE 40

Samza Architecture

Job Job Job Job Samza Kafka YARN

slide-41
SLIDE 41

Log-centric Architecture

Key-Value Query Layer Search Query Layer Stream Proces sing

Log

Hadoop Monitoring & Graphs Graph DB, OLAP Store, Etc

slide-42
SLIDE 42

Kafka http://kafka.apache.org

  • Samza

http://samza.incubator.apache.org

  • Log Blog

http://linkd.in/199iMwY

  • Me

http://www.linkedin.com/in/jaykreps @jaykreps