Lecture 16.1 Spark and RDDs EN 600.320/420 Instructor: Randal - PowerPoint PPT Presentation

Jan 12, 2023 •416 likes •498 views

Lecture 16.1 Spark and RDDs EN 600.320/420 Instructor: Randal Burns 9 April 2018 Department of Computer Science, Johns Hopkins University Spark: Batch Computing Reload Map/Reduce style programming Data-parallel, batch, restrictive

Lecture 16.1 Spark and RDDs EN 600.320/420 Instructor: Randal Burns 9 April 2018 Department of Computer Science, Johns Hopkins University
Spark: Batch Computing Reload  Map/Reduce style programming – Data-parallel, batch, restrictive model, functional – Abstractions to leverage distributed memory  New interfaces to in-memory computations – Fault-tolerant – Lazy materialization (pipelined evaluation)  Good support for iterative computations on in-memory data sets leads to good performance – 20x over Map/Reduce – No writing data to file system, loading data from file system Lecture derived from: Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing . USENIX NSDI, 2012: Lecture 16: Spark and RDDs
RDD: Resilient Distributed Dataset  Read-only partitioned collection of records  Created from: – Data in stable storage – Transformations on other RDDs  Unit of parallelism in a data decomposition: – Automatic parallelization of transformation, such as map, reduce, filter, etc.  RDDs are not data: – Not materialized. They are an abstraction. – Defined by lineage. Set of transformations on a original data set. Lecture 16: Spark and RDDs
RDD Lineage  Lines backed by HDFS  Errors – filtered lines  Time—collected makes real data Lecture 16: Spark and RDDs
Logistical Regression: A First Example:  Features: – Scala closures (on w), functions with free variables – points is a read-only RDD for each iteration – Only w (a scalar) gets updated Lecture 16: Spark and RDDs
Managing the State of Data  persist(): indicates desire to reuse an RDD, encourages Spark to keep it in memory  RDD(): the representation of a logical data set  sequence: a physical, materialized data set  In Spark-land, RDDs and sequences are differentiated by the concepts of – Transformations: RDD->RDD – Actions: RDD->sequence/data  RDDs define a pipeline of computations from data set (HDFS) to sequence/data – RDDs evaluated lazily as needed to build a sequence Lecture 16: Spark and RDDs
Transformations and Actions  Parallelized constructs in Spark – Transformations are lazy whereas actions launch computation Lecture 16: Spark and RDDs

Recommend

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark Streaming and Spark SQL Explored Streaming API of Apache Spark on Ukko Cluster Window based Stream Content Direct Stream content

221 views • 9 slides

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust - @michaelarmbrust What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros

667 views • 43 slides

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext Resilient Distributed Datasets (RDDs) Transformations Actions Code Examples Resources What is Spark? General cluster

356 views • 10 slides

CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring 2019

CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring 2019 HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1 Recap MapReduce For easily writing applications to process vast amounts of data in- parallel on large clusters in

965 views • 59 slides

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is SPARK? A sub-language of Ada 83 and 95 with particular properties that make it ideally suited to the most critical of applications: completely

848 views • 10 slides

Iteratively Improving Spark Application Performance William C. Benton Red Hat, Inc. Forecast

Iteratively Improving Spark Application Performance William C. Benton Red Hat, Inc. Forecast Background: Spark, RDDs, and Sparks execution model Case study overview Improving our prototype Background Apache Spark Introduced

1.5k views • 128 slides

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green Snyder, Ph.D. LeeAnne Green Snyder, Ph.D. May 30, 2019 May 30, 2019 Acknowledgements SPARK Families SPARK Team Clinical Sites Libby Brooks,

521 views • 40 slides

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more than a name change. It It reflects enormous change for our customers fl t h f t and our business. Our ambition is to be a winning business,

667 views • 30 slides

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of Meeting: Introductions and formalities Chairmans address Managing Director update Resolutions Shareholder questions Conduct of polls Meeting

421 views • 38 slides

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

10/05/2019 Big Data : Informatique pour les donnes et calculs massifs 7 SPARK technology Stphane Vialle Stephane.Vialle@centralesupelec.fr http://www.metz.supelec.fr/~vialle Spark Technology 1. Spark main objectives 2. RDD concepts

818 views • 39 slides

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx Streaming Spark Dataframe Spark Core (RDD) 2 Machine Learning Algorithms Supervised learning Given a set of features and labels Builds a model that

590 views • 24 slides

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark Architecture From MX to Spark MX Rich, styleable components Heavy components => Easy to use (most of the time) Spark introduces

500 views • 30 slides

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries included with Spark Spark MLlib Spark SQL GraphX Streaming machine structured graph learning real-time Spark Core Outline Introduction to

682 views • 40 slides

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A

499 views • 36 slides

Spark: A Coding Joyride Visualizing and plotting data Create a machine-learning pipeline

11/16/2015 Objectives Show Spark's ability to rapidly process Big Data Extracting information with RDDs Querying data using DataFrames Spark: A Coding Joyride Visualizing and plotting data Create a machine-learning pipeline

173 views • 6 slides

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 2 / 1

524 views • 48 slides

The STARK Framework for S patio- T emporal Data Analytics on Sp ark Stefan Hagedorn Philipp Gtze

The STARK Framework for S patio- T emporal Data Analytics on Sp ark Stefan Hagedorn Philipp Gtze Kai-Uwe Sattler TU Ilmenau partially funded under grant no. SA782/22 Motivation data analytics for decision support include spatial and/or

550 views • 24 slides

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020 Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity

779 views • 58 slides

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview Discussion on: Motivation for Big Data The MapReduce Model Hadoop distributed file system Spark data processing framework

384 views • 22 slides

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data Engineering Lab Department of Informatics Aristotle University of Thessaloniki Thessaloniki Greece 1 Outline What is Spark? Basic features

951 views • 64 slides

Implementing Atomicity with Locks Dave Cunningham April 4, 2006 Dave Cunningham Implementing

Outline Motivation Analysis Of Accessed Objects Conclusion Implementing Atomicity with Locks Dave Cunningham April 4, 2006 Dave Cunningham Implementing Atomicity with Locks 1/16 Outline Motivation Analysis Of Accessed Objects Conclusion

803 views • 61 slides

liver ver sc scan ans s vi via a a re a recu current rrent multi ti-scale scale en enco

Sp Spati tio-te tempo mporal ral mot otion on predic diction tion in free-bre breathing athing liver ver sc scan ans s vi via a a re a recu current rrent multi ti-scale scale en enco coder er dec ecod oder er Liset

673 views • 7 slides

Scientific Computing I Part II: Population Models Module 2: Population Modelling Discrete

Scientific Computing I Michael Bader Outlines Part I: Fibonaccis Rabbits Scientific Computing I Part II: Population Models Module 2: Population Modelling Discrete Models Michael Bader Lehrstuhl Informatik V Winter 2005/2006

393 views • 17 slides

Measuring B meson production at LHCb Matthew Bradley Introduction Contents of talk today:

Measuring B meson production at LHCb Matthew Bradley Introduction Contents of talk today: Brief overview of the LHCb experiment The current problem with combinatorial backgrounds Aims of the project Conclusion 1/10 2/10 LHCb

836 views • 14 slides