Resilient Distributed Datasets Presented by Henggang Cui 15799b - PowerPoint PPT Presentation

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1

Why not MapReduce • Provide fault-tolerance, but: • Hard to reuse intermediate results across multiple computations – stable storage for sharing data across jobs • Hard to support interactive ad-hoc queries 2

Why not Other In-Memory Storage • Examples: Piccolo – Apply fine-grained updates to shared states • Efficient, but: • Hard to provide fault-tolerance – need replication or checkpointing 3

Resilient Distributed Datasets (RDDs) • Restricted form of distributed shared memory – read-only, partitioned collection of records – c an only be built through coarse‐grained deterministic transformations • data in stable storage • transformations from other RDDs. • Express computation by – defining RDDs 4

Fault Recovery • Efficient fault recovery using lineage – log one operation to apply to many elements (lineage) – recompute lost partitions on failure 5

Example lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) hdfs_errors = errors.filter (_.contains(“HDFS")) 6

Advantages of the RDD Model • Efficient fault recovery – fine-grained and low-overhead using lineage • Immutable nature can mitigate stragglers – backup tasks to mitigate stragglers • Graceful degradation when RAM is not enough 7

Spark • Implementation of the RDD abstraction – Scala interface • Two components – Driver – Workers 8

Spark Runtime • Driver – defines and invokes actions on RDDs – tracks the RDDs’ lineage • Workers – store RDD partitions – perform RDD transformations 9

Supported RDD Operations • Transformations – map (f: T->U) – filter (f: T->Bool) – join() – ... (and lots of others) • Actions – count() – save() – ... (and lots of others) 10

Representing RDDs • A graph-based representation for RDDs • Pieces of information for each RDD – a set of partitions – a set of dependencies on parent RDDs – a function for computing it from its parents – metadata about its partitioning scheme and data placement 11

RDD Dependencies • Narrow dependencies – each partition of the parent RDD is used by at most one partition of the child RDD • Wide dependencies – multiple child partitions may depend on it 12

RDD Dependencies 13

RDD Dependencies • Narrow dependencies – allow for pipelined execution on one cluster node – easy fault recovery • Wide dependencies – require data from all parent partitions to be available and to be shuffled across the nodes – a single failed node might cause a complete re- execution. 14

Job Scheduling • To execute an action on an RDD – scheduler decide the stages from the RDD’s lineage graph – each stage contains as many pipelined transformations with narrow dependencies as possible 15

Job Scheduling 16

Memory Management • Three options for persistent RDDs – in-memory storage as deserialized Java objects – in-memory storage as serialized data – on-disk storage • LRU eviction policy at the level of RDDs – when there’s not enough memory, evict a partition from the least recently accessed RDD 17

Checkpointing • Checkpoint RDDs to prevent long lineage chains during fault recovery • Simpler to checkpoint than shared memory – Read-only nature of RDDs 18

Discussions 19

Checkpointing or Versioning? • Frequent checkpointing, or Keep all versions of ranks? 20

Resilient Distributed Datasets Presented by Henggang Cui 15799b - PowerPoint PPT Presentation

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1 Why not MapReduce Provide fault-tolerance, but: Hard to reuse intermediate results across multiple computations stable storage for sharing data across jobs

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M.

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2019 ADMINISTRIVIA -

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA -

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA , posted

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

Resilient Chicago 100 Resilient Cities is a global initiative that seeks to help cities around

Resilient Modulus Unbound Materials 1 Resilient Modulus M R Deviator stress Axial strain

Resilient Modulus Unbound Materials M R Resilient Modulus Axial strain Deviator stress

Resilient Food Systems, Resilient Cities Presented by Kim Zeuli Resilience A

New Initiatives in Community Resilient Power January 30, 2015 Hosted by Lewis Milford

Goals: Devise space-time DG-method for the wave equation : u tt u xx = f in Q := ] 0 , T

Lecture 18 Review: E&M, Relativity Finishing Classical Physics: Waves, E&M Timeline The

UNI Extensions for Diversity and Latency Support

Diversity Diversity Michele Morganti 2 nd ReSIST Open Workshop 18 October 2007 Rome, Italy

Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin,

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

Sambuz

Useful Links

Newsletter

Mail Us

Resilient Distributed Datasets Presented by Henggang Cui 15799b - PowerPoint PPT Presentation

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1 Why not MapReduce Provide fault-tolerance, but: Hard to reuse intermediate results across multiple computations stable storage for sharing data across jobs

MapReduce &amp; Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M.

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2019 ADMINISTRIVIA -

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA -

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA , posted

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

Resilient Chicago 100 Resilient Cities is a global initiative that seeks to help cities around

Resilient Modulus Unbound Materials 1 Resilient Modulus M R Deviator stress Axial strain

Resilient Modulus Unbound Materials M R Resilient Modulus Axial strain Deviator stress

Resilient Food Systems, Resilient Cities Presented by Kim Zeuli Resilience A

New Initiatives in Community Resilient Power January 30, 2015 Hosted by Lewis Milford

Goals: Devise space-time DG-method for the wave equation : u tt u xx = f in Q := ] 0 , T

Lecture 18 Review: E&amp;M, Relativity Finishing Classical Physics: Waves, E&amp;M Timeline The

UNI Extensions for Diversity and Latency Support

Diversity Diversity Michele Morganti 2 nd ReSIST Open Workshop 18 October 2007 Rome, Italy

Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin,

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

Sambuz

Useful Links

Newsletter

Mail Us

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Lecture 18 Review: E&M, Relativity Finishing Classical Physics: Waves, E&M Timeline The