Re Resilient Distributed Datasets: A Fa Fault-To Tolerant - - PowerPoint PPT Presentation

re resilient distributed datasets a fa fault to tolerant
SMART_READER_LITE
LIVE PREVIEW

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant - - PowerPoint PPT Presentation

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor ory Cl Cluster r Com Computi ting Authors: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J.


slide-1
SLIDE 1

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor

  • ry Cl

Cluster r Com Computi ting

Authors: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica, University

  • f California, Berkeley

NSDI’12 Awarded Best Paper! Presented by Xiaofeng Wu, adapted from Matei’s NSDI’12 presentation and other resources

1

slide-2
SLIDE 2

Problems

  • Current cluster computing frameworks
  • MapReduce
  • Dryad: distributed data-parallel programs from sequential building blocks
  • Pros: high-level operators; work distribution; fault tolerance
  • Problems when doing large-scale data analytics
  • lack abstractions for leveraging distributed memory
  • inefficient in reusing intermediate results across multiple computations
  • iterative machine learning; graph algorithms
  • e.g., PageRank, K-means clustering, and logistic regression
  • interactive data mining
  • e.g., multiple ad-hoc queries on the same subset of the data

2

slide-3
SLIDE 3

Examples

  • iter. 1
  • iter. 2

. . . Input

HDFS read HDFS write HDFS read HDFS write

Input query 1 query 2 query 3 result 1 result 2 result 3 . . .

HDFS read

Substantial overheads: data replication, disk I/O, and serialization

3

slide-4
SLIDE 4

Existing solutions

  • Specialized frameworks
  • Pregel: a system for iterative graph computations that keeps intermediate data in

memory

  • HaLoop: offer iterative MapReduce interface
  • Problems
  • They do not provide abstractions for more general reuse
  • e.g., to let a user load several datasets into memory; run ad-hoc queries
  • Existing in-memory storage on clusters
  • e.g., distributed shared memory; key-value stores; databases; Piccolo
  • Problems for data-intensive workloads
  • only provide low-level programming interface: just reads and updates to table cells
  • to copy large amounts of data over the cluster network when providing fault tolerance
  • to replicate too much data across machines or to log updates across machines

Piccolo: a new data-centric programming model for writing parallel in-memory applications in data centers.

4

slide-5
SLIDE 5

Comparison of RDDs with distributed shared memory

5

slide-6
SLIDE 6

Proposal

  • Resilient distributed datasets (RDDs)
  • Why RDD is better?
  • optimized data placement via controlling partitioning
  • let users explicitly persist intermediate results in memory
  • enables efficient data reuse in a broad range of applications
  • a rich set of operators: map, join, filter
  • fault-tolerant by logging the transformations used to build a dataset (lineage)
  • up to 20× faster than Hadoop for iterative applications, speeds up a real-

world data analytics report by 40×, and can be used interactively to scan a 1 TB dataset with 5–7s latency.

6

slide-7
SLIDE 7

Applications Not Suitable for RDDs

  • Applications that make asynchronous fine-grained updates to shared

state.

  • e.g., a storage system for a web application or an incremental web crawler
  • Reason:
  • RDDs are best suited for batch applications that apply the same operation to

all elements of a dataset

7

slide-8
SLIDE 8

What is RDD

  • A distributed memory abstraction that lets programmers perform in-

memory computations on large clusters in a fault-tolerant manner.

  • An RDD is a read-only, partitioned collection of records.
  • RDDs can only be created through deterministic operations on either

(1) data in stable storage or (2) other RDDs.

8

slide-9
SLIDE 9

RDD “Driver” Partition “Worker” Partition “Worker” Partition “Worker” Partition “Worker” Adopted from: http://www.cs.tau.ac.il/~milo/courses/ds16slides/OmriZ.pptx

9

slide-10
SLIDE 10

Question 1:

  • (1) “...individual RDDs are immutable...”
  • What does it mean by being “immutable”?
  • What benefits does this property of RDD bring?

10

slide-11
SLIDE 11

In-Memory Data Sharing

  • iter. 1
  • iter. 2

. . . Input

HDFS read HDFS write HDFS read HDFS write

Input query 1 query 2 query 3 result 1 result 2 result 3 . . .

HDFS read

  • ne-time

processing

11

slide-12
SLIDE 12

Question 2:

  • (2) When an RDD is being created (new data are being written into

it), can the data in the RDD be read for computing before the RDD is completely created? 


12

slide-13
SLIDE 13

Input" query"1" query"2" query"3" .((.((.(

RDD(Recovery(

  • ne(time"

processing"

iter."1" iter."2" .((.((.( Input"

13

slide-14
SLIDE 14

Memory" bandwidth" Network" bandwidth"

Tradeoff(Space(

Granularity(

  • f(Updates(

Write(Throughput(

Fine" Coarse" Low" High"

K(V"stores," databases," RAMCloud" Best(for(batch( workloads( Best(for( transactional( workloads( HDFS" RDDs"

14

slide-15
SLIDE 15

Spark Programming Interface

  • DryadLINQ and FlumeJava-like API in the Scala language
  • Usable interactively from Scala interpreter
  • Provides:
  • Resilient distributed datasets (RDDs) represented as objects
  • Operations on RDDs:
  • transformations (build new RDDs)
  • actions (compute and output results)
  • Control of each RDD’s partitioning (layout across nodes) and persistence

(storage in RAM, or spill on disk, etc)

15

slide-16
SLIDE 16

Example: Log Mining

  • Problem Description
  • Suppose that a web service is experiencing errors and an operator wants to

search terabytes of logs in the Hadoop filesystem (HDFS) to find the cause.

ERROR

16

slide-17
SLIDE 17

Transformations and actions available on RDDs in Spark

17

slide-18
SLIDE 18

lines = spark.textFile("hdfs://...") //transformations errors =lines.filter(_.startsWith("ERROR")) errors.persist() // persistence errors.count() // action // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split("\t")(3)).collect() Figure: the lineage graph for the RDDs in the third query

Spark can rebuild it by applying a filter on only the corresponding partition of lines.

Question 5:

Explain Figure 1 about a lineage graph. 


18

slide-19
SLIDE 19

Representing RDDs

19

Simple graph-based representation

Representing of each RDD

  • 1. a set of partitions, which are atomic pieces of the dataset.
  • 2. a set of dependencies on parent RDDs.
  • 3. a function for computing the dataset based on its parents.
  • 4. metadata of partitioning scheme.
  • 5. metadata of data placement.
slide-20
SLIDE 20

RDDs"track"the"graph"of"transformations"that" built"them"(their"lineage)"to"rebuild"lost"data" E.g.:" " "

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDD"

"

path"="hdfs://…"

FilteredRDD"

"

func"="_.contains(...)"

MappedRDD"

"

func"="_.split(…)"

Fault(Recovery(

HadoopRDD" FilteredRDD" MappedRDD"

20

slide-21
SLIDE 21

Question 3:

  • (3) “This allows them to efficiently provide fault tolerance by logging

the transformations used to build a dataset (its lineage) rather than the actual data.”

  • “To achieve fault tolerance efficiently, RDDs provide a restricted form
  • f shared memory, based on coarse-grained transformations rather

than fine-grained updates to shared state.”

  • Why does using RDD help to provide efficient fault tolerance? or why

does coarse-grained transformation help with the efficiency? 


21

slide-22
SLIDE 22

Question 4:

  • (4) “In addition, programmers can call a persist method to indicate

which RDDs they want to reuse in future operations.”

  • What’s the consequence if a user does not explicitly request

persistence of an RDD? 


22

slide-23
SLIDE 23

Example: PageRank

23

slide-24
SLIDE 24

Example:(PageRank(

1."Start"each"page"with"a"rank"of"1" 2."On"each"iteration,"update"each"page’s"rank"to" Σineighbors"ranki"/"|neighborsi| "

links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) } "

24

slide-25
SLIDE 25

Optimizing(Placement(

links"&"ranks"repeatedly"joined

Can"co,partition"them"(e.g."hash" both"on"URL)"to"avoid"shuffles" Can"also"use"app"knowledge," e.g.,"hash"on"DNS"name"

links = links.partitionBy( new URLPartitioner())

reduce" Contribs0" join" join" Contribs2" Ranks0"

(url,"rank)"

Links"

(url,"neighbors)"

. . .

Ranks2" reduce" Ranks1"

25

slide-26
SLIDE 26

PageRank(Performance(

171" 72" 23" 0" 50" 100" 150" 200" Time(per(iteration((s)( Hadoop" Basic"Spark" Spark"+"Controlled" Partitioning"

26

slide-27
SLIDE 27

Conclusion

  • Efficient, general-purpose and fault-tolerant abstraction for sharing

data in cluster applications

  • Coarse-grained transformations that let them recover data efficiently

using lineage

  • Expressive for a wide range of parallel applications like iterative

computation

27