SAND: A Fault-Tolerant Streaming Architecture for Network Traffic - PowerPoint PPT Presentation

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics Qin Liu , John C.S. Lui 1 Cheng He, Lujia Pan, Wei Fan, Yunlong Shi 2 1 The Chinese University of Hong Kong 2 Huawei Noah’s Ark Lab 1

Introduction 2

Motivation Network traffic arrives in a streaming fashion , and should be processed in real-time . For example, 1. Network traffic classification 2. Anomaly detection 3. Policy and charging control in cellular networks 4. Recommendations based on user behaviors 3

Challenges 1. A stream processing system must sustain high-speed network traffic in cellular core networks ◮ existing systems: S4 [Neumeyer’10], Storm 1 ... ◮ implemented in Java: heavy processing overheads ◮ cannot sustain high-speed network traffic 1 http://storm.incubator.apache.org/ 4

Challenges 1. A stream processing system must sustain high-speed network traffic in cellular core networks ◮ existing systems: S4 [Neumeyer’10], Storm 1 ... ◮ implemented in Java: heavy processing overheads ◮ cannot sustain high-speed network traffic 2. For critical applications, it is necessary to provide correct results after failure recovery ◮ high hardware cost ◮ cannot provide “correct results” after failure recovery ◮ at-least-once vs. exactly-once 1 http://storm.incubator.apache.org/ 4

Contributions Design and implement SAND in C++: • high performance on network traffic • a new fault tolerance scheme 5

Background 6

Background Continuous operator model : • Each node runs an operator with in-memory mutable state • For each input event, state is updated and new events are sent out Mutable state is lost if node fails. 7

Example: AppTracker • AppTracker : traffic classification for cellular network traffic • Output traffic distribution in real-time: Application Distribution HTTP 15.60% Sina Weibo 4.13% QQ 2.56% DNS 2.34% HTTP in QQ 2.17% 8

Example: AppTracker Under the continuous operator model: • Spout: capture packets from cellular network • Decoder: extract IP packets from raw packets • DPI-Engine: perform deep packet inspection on packets • Tracker: track the distribution of application level protocols (HTTP, P2P, Skype ...) 9

System Design 10

Architecture of SAND One coordinator and multiple workers . Each worker can be seen as an operator. 11

Coordinator Coordinator is responsible for • managing worker executions • detecting worker failures • relaying control messages among workers • monitoring performance statistics Zookeeper cluster provides fault tolerance and reliable coordination service . 12

Worker Contain 3 types of processes: • The dispatcher decodes streams and distributes them to multiple analyzers • Each analyzer independently processes the assigned streams • The collector aggregates the intermediate results from all analyzers The container daemon • spawns or stops the processes • communicates with the coordinator 13

Communication Channels Efficient communication channels: • Intra-worker: a lock-free shared memory ring buffer • Inter-worker: ZeroMQ, a socket library optimized for clustered products 14

Fault-Tolerance 15

Previous Fault-Tolerance Schemes 1. Replication : each operator has a replica operator [Hwang’05,Shah’04,Balazinska’08] ◮ Data streams are processed twice by two identical nodes ◮ Synchronization protocols ensures exact ordering of events in both nodes ◮ On failure, the system switches over to the replica nodes 2x hardware cost. 16

Previous Fault-Tolerance Schemes 2. Upstream backup with checkpoint [Fernandez’03,Gu’09]: ◮ Each node maintains backup of the forwarded events since last checkpoint ◮ On failure, upstream nodes replay the backup events serially to the failover node to recreate the state Less hardware cost. It’s hard to provide correct results after recovery. 17

Why is it hard? • Stateful continuous operators tightly integrate “computation” with “mutable state” • Makes it harder to define clear boundaries when computation and state can be moved around 18

Checkpointing • Need to coordinate checkpointing operation on each worker • 1985: Chandy-Lamport invented an asynchronous snapshot algorithm for distributed systems • A variant algorithm was implemented within SAND 19

Checkpointing Protocol • Coordinator initiates a global checkpoint by sending markers to all source workers • For each worker w , ◮ on receiving a data event E from worker u ◮ if marker from u has arrived, w buffers E ◮ else w processes E normally ◮ on receiving a marker from worker u ◮ if all markers have arrived, w starts checkpointing operation 20

Checkpointing Operation On each worker: • When a checkpoint starts, the worker creates child processes using fork • The parent processes then resume with the normal processing • The child processes write the internal state to HDFS, which performs replication for data reliability 21

Output Buffer Buffer output events for recovery: • Each worker records output data events in its output buffer , so as to replay output events during failure recovery • When global checkpoint c is finished, data in output buffers before checkpoint c can be deleted 22

Failure Recovery F d P F b D F a h e c g f • F : failed workers • D F : downstream workers of F • F ∪ D F : rolled back to the most recent checkpoint c • P F : the upstream workers of F ∪ D F • Workers in P F replay output events after checkpoint c 23

Evaluation 24

Experiment 1 • Testbed: one quad-core machine with 4GB RAM • Dataset: packet header trace; 331 million packets accounting for 143GB of traffic • Application: packet counter System Packets/s Payload Rate Header Rate Storm 260K 840Mb/s 81.15Mb/s Blockmon 2.7M 8.4Gb/s 844.9Mb/s SAND 9.6M 31.4Gb/s 3031.7Mb/s • 3.7X and 37.4X compared to Blockmon [Simoncelli’13] and Storm 25

Experiment 2 • Testbed: three 16-core machines with 94GB RAM • Dataset: a 2-hour network trace (32GB) collected from a commercial GPRS core network in China in 2013 • Application: AppTracker 26

Experiment 2 Interval 2s 1200 Interval 5s Interval 10s 1000 No Fault-Tolerance Throughput (Mb/s) 800 600 400 200 0 0 2 4 6 8 10 12 Number of Analyzers • Scale out by running parallel workers on multiple servers • Negligible overheads 27

Experiment 3 1000 Throughput (Mb/s) Interval 5s 800 Interval 10s t 5 t 3 t 4 600 t 1 t 2 400 200 0 0 10 20 30 40 50 60 Time (seconds) • Recover in order of seconds • Recovery time is in proportion to checkpointing interval 28

Conclusion • Present a new distributed stream processing system for network analytics • Propose a novel checkpointing protocol that provides reliable fault tolerance for stream processing systems • SAND can operate at core routers level and can recover from failure in order of seconds 29

Thank you! Q & A 30

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic - PowerPoint PPT Presentation

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics Qin Liu , John C.S. Lui 1 Cheng He, Lujia Pan, Wei Fan, Yunlong Shi 2 1 The Chinese University of Hong Kong 2 Huawei Noahs Ark Lab 1 Introduction 2 Motivation

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

West Sand Lake Fire District West Sand Lake Fire District West Sand Lake Fire District West Sand

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

WI DEPARTMENT OF NATURAL RESOURCES FRAC SAND ISSUE BRIEF January 12, 2012 BACKGROUND Frac sand

Frac Sand Mining: Frac Sand 101 and DNR Regulations WTA MONROE COUNTY MAY 24, 2012 SAND MINING

Swedish part of ARISS Contact from Sand Sand Sweden The Swedish part of ARISS contact will

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

transform millions of lives Sand dams will transform millions of lives Sand Dams are

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

A Defect- -Tolerant Tolerant A Defect Computer Architecture: Computer Architecture:

Incubate Creativity at Your Library WebJunction, April 2016 Who o am I? Laura Damon-Moore

Duy Hai DOAN @doanduyhai Who Am I ? Duy Hai DOAN Cassandra technical advocate talks, meetups,

Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry & Tyler Akidau

SOCIETY FOR INNOVATION & ENTREPRENEURSHIP Poyni Bhatt CEO SINE Background Functional

COST LONG-TERM LIVE-CELL IMAGING PLATFORM FOR BIOMEDICAL RESEARCH AND EDUCATION PURPOSE (SCBE06)

MSc European and Professor Chris Anderson Professor in European Politics and Policy

Deploying PostgreSQL on Kubernetes Jimmy Angelakos

Introducing Apache Isis Ubiquitous Language With a conscious effort by the team, the domain

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic - PowerPoint PPT Presentation

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics Qin Liu , John C.S. Lui 1 Cheng He, Lujia Pan, Wei Fan, Yunlong Shi 2 1 The Chinese University of Hong Kong 2 Huawei Noahs Ark Lab 1 Introduction 2 Motivation

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

West Sand Lake Fire District West Sand Lake Fire District West Sand Lake Fire District West Sand

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

WI DEPARTMENT OF NATURAL RESOURCES FRAC SAND ISSUE BRIEF January 12, 2012 BACKGROUND Frac sand

Frac Sand Mining: Frac Sand 101 and DNR Regulations WTA MONROE COUNTY MAY 24, 2012 SAND MINING

Swedish part of ARISS Contact from Sand Sand Sweden The Swedish part of ARISS contact will

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

transform millions of lives Sand dams will transform millions of lives Sand Dams are

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

A Defect- -Tolerant Tolerant A Defect Computer Architecture: Computer Architecture:

Incubate Creativity at Your Library WebJunction, April 2016 Who o am I? Laura Damon-Moore

Duy Hai DOAN @doanduyhai Who Am I ? Duy Hai DOAN Cassandra technical advocate talks, meetups,

Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry &amp; Tyler Akidau

SOCIETY FOR INNOVATION &amp; ENTREPRENEURSHIP Poyni Bhatt CEO SINE Background Functional

COST LONG-TERM LIVE-CELL IMAGING PLATFORM FOR BIOMEDICAL RESEARCH AND EDUCATION PURPOSE (SCBE06)

MSc European and Professor Chris Anderson Professor in European Politics and Policy

Deploying PostgreSQL on Kubernetes Jimmy Angelakos

Introducing Apache Isis Ubiquitous Language With a conscious effort by the team, the domain

Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry & Tyler Akidau

SOCIETY FOR INNOVATION & ENTREPRENEURSHIP Poyni Bhatt CEO SINE Background Functional