Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 - PowerPoint PPT Presentation

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 Lohit VijayaRenu & Gary Steelman @lohitvijayarenu @efsie

In this talk Event Logs at Twitter 1. Log Collection 2. Log Processing 3. Log Replication 4. The Future 5. Questions 6.

Overview

Life of an Event Http Clients Http Endpoint Clients Client Daemon Clients log events specifying a Category ● Clients Client Daemon name. Eg ads_view, login_event ... Client Daemon Events are grouped together across all ● clients into the Category Events are stored on Hadoop Distributed ● File System, bucketed every hour into Aggregated by Category separate directories /logs/ads_view/2017/05/01/23 ○ /logs/login_event/2017/05/01/23 ○ Storage HDFS

Event Log Stats >1T ~3PB Trillion Events a Day of Data a Day Across millions of Incoming clients uncompressed >600 <1500 Categories Nodes Event groups by Collocated with category HDFS datanodes

Event Log Architecture Remote Clients Inside Clients Clients HTTP DataCenter Local log collection daemon Log Aggregate log events grouped Processor by Category Storage (HDFS) Storage Storage (HDFS) (Streaming) Storage (HDFS) Log Storage (HDFS) Replicator

Event Log Architecture Inside Inside Events Events Events Events DC1 DC2 RT Storage (HDFS) RT Storage (HDFS) DW Storage (HDFS) DW Storage (HDFS) Cold Storage (HDFS) Prod Storage (HDFS) Prod Storage (HDFS)

Collection

Event Collection Overview Past Future Present Scribe Scribe Flume Client Client Client Daemon Daemon Daemon Scribe Flume Flume Aggregator Aggregator Aggregator Daemons Daemon Daemon

Event Collection Past Challenges with Scribe Too many open file handles to HDFS ● 600 categories x 1500 aggregators x 6 per hour =~ 5.4M files per hour ○ High IO wait on DataNodes at scale ● Max limit on throughput per aggregator ● Difficult to track message drops ● No longer active open source development ●

Event Collection Present Apache Flume Flume Agent Source Sink HDFS Client Well defined interfaces ● Open source ● Concept of transactions ● Existing implementations of ● Channel interfaces

Event Collection Category 1 Category 3 Category 2 Present Category Group Combine multiple related ● categories into a category group Provide different ● properties per group Agent 3 Agent 2 Agent 1 Contains multiple events ● to generate fewer combined sequence files Category Group

Category Groups Event Collection Group 1 Present Group 2 Aggregator Group A set of aggregators ● Aggregator Group 1 Aggregator Group 2 hosting same set of Agent 1 Agent 2 Agent 3 Agent 8 category groups Easy to manage ● group of aggregators hosting subset of categories

Event Collection Present Flume features to support groups Extend Interceptor to multiplex events into groups ● Implement Memory Channel Group to have separate memory ● channel per category group ZooKeeper registration per category group for service discovery ● Metrics for category groups ●

Event Collection Present Flume performance improvements HDFSEventSink batching increased (5x) throughput reducing ● spikes on memory channel Implement buffering in HDFSEventSink instead of using ● SpillableMemoryChannel Stream events close to network speed ●

Processing

Log Processor Stats Processing Trillion Events per Day 8 >1PB 20-50% Wall Clock Hours Data per Day Disk Space To process one Saved by Output of cleaned, day of data processing Flume compressed, sequence files consolidated, and converted

Log Processor Needs Processing Trillion Events per Day Make processing log data easier for analytics teams ● Disk space is at a premium on analytics clusters ● Still too many files cause increased pressure on the NameNode ● Log data is read many times and different teams all perform the same ● pre-processing steps on the same data sets

Log Processor Steps Datacenter 1 Category Groups Demux Jobs Categories ads_click/yyyy/mm/dd/hh ads_group/yyyy/mm/dd/hh ads_group_demuxer ads_view/yyyy/mm/dd/hh login_group_demuxer login_event/yyyy/mm/dd/hh login_group/yyyy/mm/dd/hh

Log Processor Steps 1 4 Decode Compress Base64 encoding from logged Logged data to the highest level to data save disk space. From LZO level 3 to LZO level 7 2 5 Consolidate Demux Category groups into individual Small files to reduce pressure on categories for easier consumption by the NameNode analytics teams 3 6 Convert Clean Corrupt, empty, or invalid records Some categories into Parquet for so data sets are more reliable fastest use in ad-hoc exploratory tools

Why Base64 Decoding? Legacy Choices ● Scribe’s contract amounts to sending a binary blob to a port ● Scribe used new line characters to delimit records in a binary blob batch of records ● Valid records may include new line characters ● Scribe base64 encoded received binary blobs to avoid confusion with record delimiter ● Base 64 encoding is no longer necessary because we have moved to one serialized Thrift object per binary blob

Log Demux Visual /raw/ads_group/yyyy/mm/dd/hh/ads_group_1.seq DEMUX /logs/ads_view/yyyy/mm/dd/hh/1.lzo /logs/ads_click/yyyy/mm/dd/hh/1.lzo /logs/ads_view/yyyy/mm/dd/hh/1.lzo

Log Processor Daemon One log processor daemon per RT Hadoop cluster, where Flume ● aggregates logs Primarily responsible for demuxing category groups out of the Flume ● sequence files The daemon schedules Tez jobs every hour for every category group in a ● thread pool Daemon atomically presents processed category instances so partial data ● can’t be read Processing proceeds according to criticality of data or “tiers” ●

Why Tez? ● Some categories are significantly larger than other categories (KBs v TBs) ● MapReduce demux? Each reducer handles a single category ● Streaming demux? Each spout or channel handles a single category ● Massive skew in partitioning by category causes long running tasks which slows down job completion time ● Relatively well understood fault tolerance semantics similar to MapReduce, Spark, etc

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 - PowerPoint PPT Presentation

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 Lohit VijayaRenu & Gary Steelman @lohitvijayarenu @efsie In this talk Event Logs at Twitter 1. Log Collection 2. Log Processing 3. Log Replication 4. The Future

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Joel BARCIAUSKAS

Scalable Routing Outline Routing Algorithms Scalability 1 Overview Forwarding vs Routing

Ad Hoc Wireless Routing CS 218- Fall 2003 Wireless multihop routing challenges Review of

Routing Algebras What are routing algebras? Created to study properties of routing protocols

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Ian NOWLAND

Advanced routing topics Tuomas Launiainen Suboptimal routing Routing trees Measurement of

Interplay between routing and forwarding routing algorithm Routing Algorithms and Routing local

4.3 Routing protocols We first look at Routing Tables and routing mechanisms. A routing table has

Landmark Landmark-based routing based routing Landmark Landmark-based routing based routing

Outline Integer Programming DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Vehicle Routing

Global routing Global routing Global routing Global routing Bill Swartz Bill Swartz

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Routing In Ad Hoc Networks 1. Introduction to Ad-hoc networks 2. Routing in Ad-hoc networks 3.

External Routing External Routing BGP JeanYves Le Boudec Fall 2009 Self Organization 1 1

Crawling Twitter Data Konstantinos Semertzidis ksemer@cs.uoi.gr What types of information can we

Topical Semantics of Twitter Links Jan Vosecky About the paper T opical Semantics of T

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

analysis of discussions in twitter with an argumentation tool T. Alsinet, J. Argelich, R. Bjar,

Collecting & Analyzing Twitter data an Introduction Viktoria Spaiser UAF in Political

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi PBFT: A B

F AASM : Lightweight Isolation for Efficient Stateful Serverless Computing Simon Shillaker and

Equal Representation in Two-tier Voting Systems Nicola Maaser and Stefan Napel Economics

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 - PowerPoint PPT Presentation

Routing Trillions of Events Per Day @Twitter #ApacheBigData 2017 Lohit VijayaRenu & Gary Steelman @lohitvijayarenu @efsie In this talk Event Logs at Twitter 1. Log Collection 2. Log Processing 3. Log Replication 4. The Future

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Joel BARCIAUSKAS

Scalable Routing Outline Routing Algorithms Scalability 1 Overview Forwarding vs Routing

Ad Hoc Wireless Routing CS 218- Fall 2003 Wireless multihop routing challenges Review of

Routing Algebras What are routing algebras? Created to study properties of routing protocols

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

Datadog: A Real-Time Metrics Database for Trillions of Points/Day Ian NOWLAND

Advanced routing topics Tuomas Launiainen Suboptimal routing Routing trees Measurement of

Interplay between routing and forwarding routing algorithm Routing Algorithms and Routing local

4.3 Routing protocols We first look at Routing Tables and routing mechanisms. A routing table has

Landmark Landmark-based routing based routing Landmark Landmark-based routing based routing

Outline Integer Programming DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Vehicle Routing

Global routing Global routing Global routing Global routing Bill Swartz Bill Swartz

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Routing In Ad Hoc Networks 1. Introduction to Ad-hoc networks 2. Routing in Ad-hoc networks 3.

External Routing External Routing BGP JeanYves Le Boudec Fall 2009 Self Organization 1 1

Crawling Twitter Data Konstantinos Semertzidis ksemer@cs.uoi.gr What types of information can we

Topical Semantics of Twitter Links Jan Vosecky About the paper T opical Semantics of T

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

analysis of discussions in twitter with an argumentation tool T. Alsinet, J. Argelich, R. Bjar,

Collecting &amp; Analyzing Twitter data an Introduction Viktoria Spaiser UAF in Political

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 Slides by: Lorenzo Alvisi PBFT: A B

F AASM : Lightweight Isolation for Efficient Stateful Serverless Computing Simon Shillaker and

Equal Representation in Two-tier Voting Systems Nicola Maaser and Stefan Napel Economics

Collecting & Analyzing Twitter data an Introduction Viktoria Spaiser UAF in Political