SLIDE 1 Fully Fault Tolerant Real Time Data Pipeline with Docker and Mesos
Rahul Kumar
Technical Lead
LinuxCon / ContainerCon - Berlin, Germany
SLIDE 2 Agenda
- Data Pipeline
- Mesos + Docker
- Reactive Data Pipeline
SLIDE 3
Goal
Analyzing data always have great benefits and is one of the greatest challenge for an organization.
SLIDE 4
Today’s business generates massive amount of digital data.
SLIDE 5
which is cumbersome to store, transport and analyze
SLIDE 6 Making distributed system and
commodity clusters is one of the better approach to solve data problem
SLIDE 7
SLIDE 8 Characteristics Of a distributed system
❏
Resource Sharing
❏
Openness
❏
Concurrency
❏
Scalability
❏
Fault Tolerance
❏
Transparency
SLIDE 9
Collect Store Process Analyze
SLIDE 10
Data Center
SLIDE 11
Manually Scale Frameworks & Install services
SLIDE 12
Complex Very Limited Inefficient Low Utilization
SLIDE 13
Static Partitioning Blocker for Fault Tolerant data pipeline
SLIDE 14
Failure make it even more complex to manage
SLIDE 15
Apache Mesos
“Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.”
SLIDE 16
SLIDE 17 Mesos Features
- Scalability: scale up to 10,000s of nodes
- Fault-tolerant: replicated master and slaves using ZooKeeper
- Docker support: Support for Docker containers
- Native Container: Linux Native isolation between tasks with Linux
Containers
- Scheduling: Multi-resource scheduling (memory, CPU, disk, and
ports)
- API supports: Java, Python and C++ APIs for developing new parallel
applications
- Monitoring: Web UI for viewing cluster state
SLIDE 18
SLIDE 19
Resource Isolation
SLIDE 20
SLIDE 21
SLIDE 22
Docker Containerizer
Mesos adds the support for launching tasks that contains Docker images Users can either launch a Docker image as a Task, or as an Executor. To run the mesos-agent to enable the Docker Containerizer, “docker” must be set as one of the containerizers option mesos-agent --containerizers=docker,mesos
SLIDE 23
SLIDE 24 Mesos Frameworks
- Aurora: Aurora was developed at Twitter and the migrated to Apache
Project later. Aurora is a framework that keeps service running across a shared pool of machines, and responsible for keeping them running forever.
- Marathon: It is a framework for container orchestration for Mesos.
Marathon helps to run other framework on Mesos. Marathon also runs
- ther application container such as Jetty, JBoss Server, Play Server.
- Chronos: Fault tolerance job scheduler for Mesos, It was developed at
Airbnb as replacement of cron.
SLIDE 25 Resilient Distributed Datasets (RDDs)
which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
Spark Stack
SLIDE 26 Many big-data applications need to process large data streams in near-real time
Monitoring Systems Alert Systems Computing Systems
Why Spark Streaming?
SLIDE 27 Taken from Apache Spark.
What is Spark Streaming?
SLIDE 28
Framework for large scale stream processing ➔ Created at UC Berkeley ➔ Scales to 100s of nodes ➔ Can achieve second scale latencies ➔ Provides a simple batch-like API for implementing complex algorithm ➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
What is Spark Streaming?
SLIDE 29 Run a streaming computation as a series of very small, deterministic batch jobs
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs
and processes them using RDD operations
- Finally, the processed results of the RDD
- perations are returned in batches
Spark Streaming
SLIDE 30 Point of Failure
Simple Streaming Pipeline
SLIDE 31
SLIDE 32
- To use Mesos from Spark, you need a Spark binary package available in a
place accessible (http/s3/hdfs) by Mesos, and a Spark driver program configured to connect to Mesos.
- Configuring the driver program to connect to Mesos:
val sconf = new SparkConf() .setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos") .setAppName("MyStreamingApp") .set("spark.executor.uri","hdfs://Sigmoid/executors/spark-1.3.0-bin-hadoop2.4.tgz") .set("spark.mesos.coarse", "true") .set("spark.cores.max", "30") .set("spark.executor.memory", "10g") val sc = new SparkContext(sconf) val ssc = new StreamingContext(sc, Seconds(1)) ...
Spark Streaming over a HA Mesos Cluster
SLIDE 33 Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system.
- Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in
the cluster.
- In Streaming, driver failure can be recovered with checkpointing application state.
- Write Ahead Logs (WAL) & Acknowledgements can ensure 0 data loss.
Spark Streaming Fault-tolerance
SLIDE 34
Simple Fault-tolerant Streaming Infra
SLIDE 35
SLIDE 36
- Figure out the bottleneck : CPU, Memory, IO, Network
- If parsing is involved, use the one which gives
high performance.
- Proper Data modeling
- Compression, Serialization
Creating a scalable pipeline
SLIDE 37 Thank You
@rahul_kumar_aws
LinuxCon / ContainerCon - Berlin, Germany