Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for - PowerPoint PPT Presentation

Jul 01, 2023 •370 likes •533 views

Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for iterative jobs Reuse of intermediate results without writing to disk Lineage-based fault tolerance Does not require checkpointing all intermediate results 3 3

Spark Marco Serafini COMPSCI 532 Lecture 4
Goals • Support for iterative jobs • Reuse of intermediate results without writing to disk • Lineage-based fault tolerance • Does not require checkpointing all intermediate results 3 3
Resilient Data Sets • Collection of records (serialized objects) • Read-only • Created through deterministic transformations from • Data in storage • Other RDDs • Lineage of an RDD • Sequence of transformations that create it • Replayed (in parallel) from persisted data to recreate lost RDD • Caching an RDD: keeping it in memory for later 4
Spark terminology • Driver • Process executing the application code • Sends RDD transformations and actions to workers • Workers • Host partitions of RDDs • Execute transformations 5 5
More Terminology • Task • Unit of work that will be sent to one executor • Partition of fundamental operator, such as Map and Reduce • Stage • Set of parallel tasks one task per partition • Can include multiple pipelined tasks with no shuffling • Job • Parallel computation consisting of multiple stages • Spawned in response to an action 6 6
More terminology • Shuffle • Data transfer among workers • Partition • Worker thread • Transformation • Function that produces new RDD but no output • Lazily evaluated • Actions • Function that returns output • Triggers evaluation 7 7
Spark API 8 8
Spark Computation • Driver executes application code • Workers execute only transformations • Driver sends closure to workers • Lazy evaluation • Driver records transformations without executing them • It builds a a Directed Acyclic Graph (DAG) of transformations • Execute only as needed when output (action) required 9 9
Closures • Driver serializes functions to be executed by workers • It computes a “closure” and sends it to the workers • Closure includes all objects referenced by the function • Careful with references! Or you might send huge closures 10 10
Narrow vs. Wide Operators • Narrow dependency • Executes on data local to the same worker • Can be pipelined locally • Faster recovery (only local re-execution needed) • Example: map à filter • Wide dependency • Requires a shuffle, which marks the end of a stage • Complex recovery (multi-worker) • Example: map à groupByKey (if don’t partition by that key) 11 11
Checkpoints and Partitioning • Partitioning function declared when RDD created • Checkpointing • Speeds up recovery of lineage • When to checkpoint is left to the user • Checkpoint is stored (replicated) on file system 12 12
PageRank Example 13 13
Questions • Which intermediate RDDs are created? • Stages? Narrow/wide operators? • How to reduce shuffling? • Lineage graph 14 14
Lineage Graph 15 15

Recommend

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark Streaming and Spark SQL Explored Streaming API of Apache Spark on Ukko Cluster Window based Stream Content Direct Stream content

221 views • 9 slides

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust - @michaelarmbrust What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros

667 views • 43 slides

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is SPARK? A sub-language of Ada 83 and 95 with particular properties that make it ideally suited to the most critical of applications: completely

848 views • 10 slides

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark Architecture From MX to Spark MX Rich, styleable components Heavy components => Easy to use (most of the time) Spark introduces

500 views • 30 slides

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more than a name change. It It reflects enormous change for our customers fl t h f t and our business. Our ambition is to be a winning business,

667 views • 30 slides

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of Meeting: Introductions and formalities Chairmans address Managing Director update Resolutions Shareholder questions Conduct of polls Meeting

421 views • 38 slides

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green Snyder, Ph.D. LeeAnne Green Snyder, Ph.D. May 30, 2019 May 30, 2019 Acknowledgements SPARK Families SPARK Team Clinical Sites Libby Brooks,

521 views • 40 slides

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

10/05/2019 Big Data : Informatique pour les donnes et calculs massifs 7 SPARK technology Stphane Vialle Stephane.Vialle@centralesupelec.fr http://www.metz.supelec.fr/~vialle Spark Technology 1. Spark main objectives 2. RDD concepts

818 views • 39 slides

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries included with Spark Spark MLlib Spark SQL GraphX Streaming machine structured graph learning real-time Spark Core Outline Introduction to

682 views • 40 slides

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A

499 views • 36 slides

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx Streaming Spark Dataframe Spark Core (RDD) 2 Machine Learning Algorithms Supervised learning Given a set of features and labels Builds a model that

590 views • 24 slides

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext Resilient Distributed Datasets (RDDs) Transformations Actions Code Examples Resources What is Spark? General cluster

356 views • 10 slides

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more

1.5k views • 52 slides

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK event.cwi.nl/lsde What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Up to 100 faster Improves efficiency

548 views • 36 slides

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 2 / 1

524 views • 48 slides

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on

472 views • 36 slides

Distributed Execution William Cook and Jayadev Misra f cook,misra g @cs.utexas.edu Department of

D EPARTMENT OF C OMPUTER S CIENCES Distributed Execution William Cook and Jayadev Misra f cook,misra g @cs.utexas.edu Department of Computer Science University of Texas at Austin Email: web: http://www.cs.utexas.edu/users/psp U NIVERSITY OF T

301 views • 18 slides

Investor Presentation 24 August 2015 ASEAN Stars Conference 2012 1 March 2012 Asias First

4Q FY2011/12 1Q FY2015/16 Investor Presentation Investor Presentation 24 August 2015 ASEAN Stars Conference 2012 1 March 2012 Asias First Listed Indian Property Trust Asias First Listed Indian Property Trust Disclaimer This presentation

739 views • 50 slides

A Formal Analysis of Some Properties of Kerberos 5 Using MSR Frederick Butler, Iliano Cervesato,

A Formal Analysis of Some Properties of Kerberos 5 Using MSR Frederick Butler, Iliano Cervesato, Aaron D. Jaggard, and Andre Scedrov Project Goals Give precise statement and formal analysis of a real world protocol Find a real world

474 views • 25 slides

Transaction-based Definitions and Implementations of Community-of-Interest Languages Dr. Andreas

Presentation Title 2/4/09 Transaction-based Definitions and Implementations of Community-of-Interest Languages Dr. Andreas Tolk, Old Dominion University Mr. Saikou Diallo, VMASC Thanks for Sponsored Research US Joint Forces Command

731 views • 14 slides

Compositional Verification of Security Properties for Embedded Execution Platforms Christoph

Compositional Verification of Security Properties for Embedded Execution Platforms Christoph Baumann , Oliver Schwarz , Mads Dam KTH Royal Institute of Technology, Stockholm, Sweden RISE.SICS, Kista, Sweden cbaumann@kth.se

903 views • 65 slides

Limits Definition 1 (Limit) . If the values f ( x ) of a function f : A B get very close to a

Limits Definition 1 (Limit) . If the values f ( x ) of a function f : A B get very close to a specific, unique number L when x is very close to, but not necessarily equal to , a limit point c , we say the limit of f ( x ), as x approaches c , is

641 views • 53 slides

Universal constructions: limits and colimits Consider and arbitrary but fixed category K for a

Universal constructions: limits and colimits Consider and arbitrary but fixed category K for a while. Andrzej Tarlecki: Category Theory, 2018 - 59 - Initial and terminal objects An object I | K | is initial in K if for each object A | K

239 views • 22 slides

Dynamic Panel Data estimators Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Dynamic Panel Data estimators Christopher F Baum ECON 8823: Applied Econometrics Boston College, Spring 2015 Christopher F Baum (BC / DIW) Dynamic Panel Data estimators Boston College, Spring 2015 1 / 50 Dynamic panel data estimators

882 views • 50 slides