Re Resilient Distributed Datasets: A Fa Fault-To Tolerant - PowerPoint PPT Presentation

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor ory Cl Cluster r Com Computi ting Authors: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica, University of California, Berkeley NSDI’12 Awarded Best Paper! Presented by Xiaofeng Wu, adapted from Matei’s NSDI’12 presentation and other resources 1

Problems • Current cluster computing frameworks • MapReduce • Dryad: distributed data-parallel programs from sequential building blocks • Pros: high-level operators; work distribution; fault tolerance • Problems when doing large-scale data analytics • lack abstractions for leveraging distributed memory • inefficient in reusing intermediate results across multiple computations • iterative machine learning; graph algorithms • e.g., PageRank, K-means clustering, and logistic regression • interactive data mining • e.g., multiple ad-hoc queries on the same subset of the data 2

Examples HDFS HDFS HDFS HDFS read write read write iter. 1 iter. 2 . . . Input query 1 result 1 HDFS read query 2 result 2 query 3 result 3 Input . . . Substantial overheads: data replication, disk I/O, and serialization 3

Existing solutions • Specialized frameworks • Pregel: a system for iterative graph computations that keeps intermediate data in memory • HaLoop: offer iterative MapReduce interface • Problems • They do not provide abstractions for more general reuse • e.g., to let a user load several datasets into memory; run ad-hoc queries • Existing in-memory storage on clusters • e.g., distributed shared memory; key-value stores; databases; Piccolo • Problems for data-intensive workloads • only provide low-level programming interface : just reads and updates to table cells • to copy large amounts of data over the cluster network when providing fault tolerance • to replicate too much data across machines or to log updates across machines 4 Piccolo: a new data-centric programming model for writing parallel in-memory applications in data centers.

Comparison of RDDs with distributed shared memory 5

Proposal • Resilient distributed datasets (RDDs) • Why RDD is better? • optimized data placement via controlling partitioning • let users explicitly persist intermediate results in memory • enables efficient data reuse in a broad range of applications • a rich set of operators : map, join, filter • fault-tolerant by logging the transformations used to build a dataset ( lineage ) • up to 20× faster than Hadoop for iterative applications, speeds up a real- world data analytics report by 40×, and can be used interactively to scan a 1 TB dataset with 5–7s latency. 6

Applications Not Suitable for RDDs • Applications that make asynchronous fine-grained updates to shared state. • e.g., a storage system for a web application or an incremental web crawler • Reason: • RDDs are best suited for batch applications that apply the same operation to all elements of a dataset 7

What is RDD • A distributed memory abstraction that lets programmers perform in- memory computations on large clusters in a fault-tolerant manner. • An RDD is a read-only, partitioned collection of records. • RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs. 8

“Worker” “Driver” Partition RDD “Worker” Partition “Worker” Partition “Worker” Partition 9 Adopted from: http://www.cs.tau.ac.il/~milo/courses/ds16slides/OmriZ.pptx

Question 1: • (1) “...individual RDDs are immutable...” • What does it mean by being “immutable”? • What benefits does this property of RDD bring? 10

In-Memory Data Sharing HDFS HDFS HDFS HDFS read write read write iter. 1 iter. 2 . . . Input query 1 result 1 HDFS one-time read processing query 2 result 2 query 3 result 3 Input . . . 11

Question 2: • (2) When an RDD is being created (new data are being written into it), can the data in the RDD be read for computing before the RDD is completely created?   12

RDD(Recovery( iter."1" iter."2" .((.((.( Input" query"1" one(time" processing" query"2" query"3" Input" .((.((.( 13

Tradeoff(Space( Network" Memory" bandwidth" bandwidth" Fine" Best(for( K(V"stores," transactional( databases," workloads( RAMCloud" Granularity( of(Updates( Best(for(batch( workloads( HDFS" RDDs" Coarse" Low" High" Write(Throughput( 14

Spark Programming Interface • DryadLINQ and FlumeJava-like API in the Scala language • Usable interactively from Scala interpreter • Provides: • Resilient distributed datasets (RDDs) represented as objects • Operations on RDDs: • transformations (build new RDDs) • actions (compute and output results) • Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, or spill on disk, etc) 15

Example: Log Mining • Problem Description • Suppose that a web service is experiencing errors and an operator wants to search terabytes of logs in the Hadoop filesystem (HDFS) to find the cause. ERROR 16

Transformations and actions available on RDDs in Spark 17

Question 5: lines = spark.textFile("hdfs://...") Explain Figure 1 about a lineage graph.   //transformations errors =lines.filter(_.startsWith("ERROR")) errors.persist() // persistence errors.count() // action // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split("\t")(3)).collect() Figure: the lineage graph for the RDDs in the third query Spark can rebuild it by applying a filter on only the corresponding partition of lines. 18

Representing RDDs Simple graph-based representation Representing of each RDD 1. a set of partitions , which are atomic pieces of the dataset. 2. a set of dependencies on parent RDDs. 3. a function for computing the dataset based on its parents. 4. metadata of partitioning scheme . 5. metadata of data placement . 19

Fault(Recovery( RDDs"track"the"graph"of"transformations"that" built"them"(their" lineage )"to"rebuild"lost"data" E.g.:" messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) " " HadoopRDD" FilteredRDD" MappedRDD" HadoopRDD" FilteredRDD" MappedRDD" " " " path"="hdfs://…" func"="_.contains(...)" func"="_.split(…)" 20

Question 3: • (3) “This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.” • “To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state.” • Why does using RDD help to provide efficient fault tolerance? or why does coarse-grained transformation help with the efficiency?   21

Question 4: • (4) “In addition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations.” • What’s the consequence if a user does not explicitly request persistence of an RDD?   22

Example: PageRank 23

Example:(PageRank( 1."Start"each"page"with"a"rank"of"1" 2."On"each"iteration,"update"each"page’s"rank"to" Σ i � neighbors "rank i "/"|neighbors i | " links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) } 24 "

Optimizing(Placement( links "&" ranks "repeatedly"joined Links" Ranks 0" (url,"neighbors)" (url,"rank)" Can" co,partition "them"(e.g."hash" join" both"on"URL)"to"avoid"shuffles" Contribs 0" reduce" Can"also"use"app"knowledge," Ranks 1" e.g.,"hash"on"DNS"name" join" Contribs 2" links = links.partitionBy( reduce" new URLPartitioner()) Ranks 2" . . . 25

PageRank(Performance( 171" 200" Time(per(iteration((s)( Hadoop" 150" Basic"Spark" 100" 72" Spark"+"Controlled" 50" 23" Partitioning" 0" 26

Conclusion • Efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications • Coarse-grained transformations that let them recover data efficiently using lineage • Expressive for a wide range of parallel applications like iterative computation 27

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant - PowerPoint PPT Presentation

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor ory Cl Cluster r Com Computi ting Authors: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J.

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M.

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

Non-Cryptographic Fault-Tolerant Distributed Computation Marek Hamerlik December 6, 2007 Marek

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

Information Cascades in Human Networks Milo Trujillo Professor Gao Information Cascades

Computer Science II for Majors Lecture 13 Friends and More Dr. Katherine Gibson www.umbc.edu

Muzzling Antitrust: Information Product Redesign, Innovation & Free Speech New York State Bar

kernelci.org (The upstream Linux kernel validation project) by Milo Casagrande Who is This Guy?

Aug ust 26, 2020 | 6:30 - 7:30 PM AGE NDA 1. We bE x me e ting fo rmat 2. I ntro duc tio n

Securing Hardware Platforms Against Malicious Circuits Through Static Analysis Matthew Hicks -

Preface Preface As we walk, our locomotion reveals our destinations. As we talk, our speech

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant - PowerPoint PPT Presentation

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor ory Cl Cluster r Com Computi ting Authors: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J.

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M.

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault-Tolerant Distributed Optimization Lili Su, Arun Padakandla, Qiong Hu, Seyyed A. Fatemi,

Computability Abstractions for Fault-tolerant Asynchronous Distributed Computing Julien Stainer

Non-Cryptographic Fault-Tolerant Distributed Computation Marek Hamerlik December 6, 2007 Marek

MapReduce &amp; Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin &amp; Amir Roth at U. Penn

Information Cascades in Human Networks Milo Trujillo Professor Gao Information Cascades

Computer Science II for Majors Lecture 13 Friends and More Dr. Katherine Gibson www.umbc.edu

Muzzling Antitrust: Information Product Redesign, Innovation &amp; Free Speech New York State Bar

kernelci.org (The upstream Linux kernel validation project) by Milo Casagrande Who is This Guy?

Aug ust 26, 2020 | 6:30 - 7:30 PM AGE NDA 1. We bE x me e ting fo rmat 2. I ntro duc tio n

Securing Hardware Platforms Against Malicious Circuits Through Static Analysis Matthew Hicks -

Preface Preface As we walk, our locomotion reveals our destinations. As we talk, our speech

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

Muzzling Antitrust: Information Product Redesign, Innovation & Free Speech New York State Bar