Resilient Distributed Datasets: A Fault-Tolerant Abstraction for - PowerPoint PPT Presentation

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica Computer Laboratory

Principal Motivation • MapReduce/Dryad built around acyclic flow of data • Inefficient at handling iterative computation & data reuse • Machine Learning Algorithms • Interactive data mining tools • Propose a solution for a class of applications that require • Working sets of data • scalability and fault tolerance 2

Resilient Distributed Datasets Key Idea • Leverage distributed memory • Improve upon specialised frameworks e.g. Haloop, Pregel, etc. What are RDDs? • Read-only collection objects • Partitioned across several nodes • Reconstructible incase of node failure • Enables in-memory computation � � 3

Resilient Distributed Datasets Representation of RDDs • set of partitions • set of dependencies — lineage • function to compute RDD from parent RDDs • metadata on partitioning scheme & data placement Lineage • Recompute elements of a partition • Iterate over parent partitions; use the function in RDD 4

RDDs: Types of Dependencies Narrow Dependencies • One-to-one mapping of partitions between parent & child • Pipelined execution on cluster nodes • Involve map operation Wide Dependencies • Many-to-one mapping between parent & child • Require data from all parent partitions and shuffle-like operation • Involve join operation � � 5

Resilient Distributed Datasets Key Differences 1 � Aspect RDDs Dist. Shared Mem. Reads Coarse or fine grained Fine-grained Writes Coarse-grained; Fine-grained immutable consistency Behaviour if not Similar to existing data Poor performance enough RAM flow systems Fault Recovery Fine grained & low- Requires checkpoints overhead using lineage & rollbacks 6

Resilient Distributed Datasets Computational Factors • Cost of storage • Disk I/O overhead • Probability of node failure • Cost of recomputing a partition Limitations • Inefficient for asynchronous fine-grained updates • E.g. incremental web crawler, storage system for a webApp,etc. 7

Spark: Cluster Computing Framework Introduction • Implemented in Scala • Built on top of Mesos (cluster operating system) • Enables resource sharing with Hadoop MPI • RDD implementation • HDFS file objects • partition-to-block size mapping 8

Spark: RDD representation Types of RDD constructs • File in a shared file system e.g. HDFS • Scala collection object e.g. an array • Transforming an existing RDD using flatMap() • Change persistence of an existing RDD • Cache action: dataset is kept in memory • Save action: dataset is written to the file system 9

Spark: Dataflow • Driver program implements control flow • Parallel programming abstractions • RDDs • parallel operations • Types of parallel operations • reduce • collect • foreach 10

Spark: Dataflow Job Scheduling • RDD lineage graph examined • DAG of stages is built • Characteristics of a stage • as many narrow dependencies • Wide dependencies require shuffle operation • Tasks assigned on data locality 11

Spark: Limitations • Scheduler failures not tolerated • re-run the task till stage’s parents available • else, replicate RDD lineage graph to compute partition • Checkpointing API application/user dependent • Replicate Flag to persist 12

Spark: Assessment Datasets • User written applications • ML algorithms: K-means & logistical regression • 1 TB dataset for interactive queries Benchmarks • Hadoop: 0.20.2 stable release • HadoopBinMem • converts input data to binary format • reduces over-head 13

Spark: Assessment ML Algorithms • Spark outperforms hadoop by 20x • Avoided repeated I/O and deserialisation cost Interactive query dataset • Spark performed with the response time of 5.5-7s • Dependent on the page rank implementation User Applications • Analytics report execution improved by 40x • Other apps scale and perform well � 14

RDDs: Conclusion • Showed better performance • Express cluster programming models • Capture optimisations • keeping specific data in-memory • partitioning to minimize communication • recover from failures efficiently • Promising paradigm in cluster computing 15

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for - PowerPoint PPT Presentation

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica Computer Laboratory Principal Motivation

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

Non-Cryptographic Fault-Tolerant Distributed Computation Marek Hamerlik December 6, 2007 Marek

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

The Olympus High Performance Computing Cluster: A Resource for MIDAS Researchers Shawn T. Brown,

Using Tripwire to check cluster system integrity Elio P erez Calle Miguel C ardenas Montes

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November 2019 Thierry Cruanes, Co-Founder

High Performance and Scalable MPI+X Library for Emerging HPC Clusters Talk at Intel HPC Developer

Workforce in Iow as Creative Corridor University of Iowa January 2014 Strategic Skills Study

Spectral Clustering on Handwritten Digits Database Mid-Year Presentation Danielle Middlebrooks

Custom Execution Environments with Containers in Pegasus-enabled Scientific Workflows Karan Vahi

High Performance Embedded High Performance Embedded Systems Systems IT Integration Solutions

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for - PowerPoint PPT Presentation

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica Computer Laboratory Principal Motivation

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

Non-Cryptographic Fault-Tolerant Distributed Computation Marek Hamerlik December 6, 2007 Marek

MapReduce &amp; Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

The Olympus High Performance Computing Cluster: A Resource for MIDAS Researchers Shawn T. Brown,

Using Tripwire to check cluster system integrity Elio P erez Calle Miguel C ardenas Montes

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November 2019 Thierry Cruanes, Co-Founder

High Performance and Scalable MPI+X Library for Emerging HPC Clusters Talk at Intel HPC Developer

Workforce in Iow as Creative Corridor University of Iowa January 2014 Strategic Skills Study

Spectral Clustering on Handwritten Digits Database Mid-Year Presentation Danielle Middlebrooks

Custom Execution Environments with Containers in Pegasus-enabled Scientific Workflows Karan Vahi

High Performance Embedded High Performance Embedded Systems Systems IT Integration Solutions

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -