Spark and Friends Presented by: Jeff Rasley & John - PowerPoint PPT Presentation

Spark ¡and ¡Friends ¡ Presented ¡by: ¡Jeff ¡Rasley ¡& ¡John ¡Meehan ¡

Resilient ¡Distributed ¡Datasets: ¡ ¡ A ¡Fault-‑Tolerant ¡Abstrac;on ¡for ¡In-‑Memory ¡ Cluster ¡Compu;ng ¡ UC ¡Berkeley, ¡AMP ¡Lab ¡ NSDI ¡2012 ¡ Presented ¡by: ¡Jeff ¡Rasley ¡

Outline ¡ • Mo;va;on ¡ • Resilient ¡Distributed ¡Datasets ¡ • Implementa;on ¡ • Examples ¡ • Performance ¡ • Discussion ¡ • Summary ¡ • Demo ¡

Mo7va7on ¡ Slow ¡due ¡to ¡replica;on, ¡however ¡it ¡is ¡required ¡for ¡fault-‑tolerance ¡

Resilient ¡Distributed ¡Datasets ¡(RDDs) ¡ Significantly ¡faster, ¡but ¡what ¡about ¡fault-‑tolerance? ¡

RDDs: ¡Fault ¡Tolerance ¡ • We ¡could ¡replicate ¡data ¡and/or ¡logs ¡across ¡ cluster ¡ o Expensive! ¡ o These ¡systems ¡exist ¡for ¡fine-‑grained ¡updates ¡ § RAMCloud, ¡distributed ¡mem, ¡Piccolo, ¡databases, ¡etc. ¡ • Instead ¡only ¡allow ¡coarse-‑grained ¡updates ¡ o Log ¡determinis;c ¡transforma;on ¡opera;ons ¡ ¡ § map, ¡join, ¡filter, ¡etc. ¡ o Fault ¡recovery ¡by ¡replaying ¡update ¡lineage ¡

Tradeoffs ¡ RDDs ¡ ¡v. ¡ ¡HDFS ¡ ¡v. ¡ ¡K-‑V ¡stores ¡

Implementa7on ¡-‑ ¡Apache ¡Spark ¡ • Spark ¡is ¡an ¡actual ¡implementa;on ¡of ¡RDDs ¡ • Works ¡with ¡the ¡Scala ¡interpreter ¡ o Great ¡for ¡interac;ve ¡queries! ¡ • Open ¡source: ¡spark.incubator.apache.org ¡ • Read ¡data ¡from ¡HDFS ¡or ¡AWS ¡S3 ¡ • Uses: ¡Spam ¡Classifica;on, ¡DNA ¡Sequencing, ¡ Interac;ve ¡Data ¡Mining ¡

Example ¡-‑ ¡Console ¡Log ¡Mining ¡ lines ¡= ¡spark.textFile("hdfs://...") ¡ errors ¡= ¡lines.filter(_.startsWith("ERROR")) ¡ errors.persist() ¡ errors.filter(_.contains("HDFS")) ¡ ¡ ¡.map(_.split('\t')(3)) ¡ ¡ ¡.collect() ¡ Color ¡Key: ¡ Transformation ¡ Action ¡ Closure ¡

Spark ¡Opera7ons ¡

Spark: ¡Job ¡Stages ¡ Each stage is scheduled as a task in a pipeline to produce the Key ¡ final results automatically by the Shaded ¡boxes ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑ ¡ ¡ ¡RDDs ¡ job scheduler. Shaded ¡Outlines ¡ ¡ ¡ ¡-‑ ¡ ¡ ¡Par;;ons ¡ Arrows ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑ ¡ ¡ ¡Data ¡transfer ¡between ¡RDDs ¡ ¡

Failure ¡Graph ¡ Itera;on ¡;mes ¡for ¡k-‑means ¡in ¡presence ¡of ¡a ¡failure. ¡One ¡machine ¡was ¡ killed ¡at ¡the ¡start ¡of ¡the ¡6th ¡itera;on, ¡resul;ng ¡in ¡par;al ¡reconstruc;on ¡of ¡ an ¡RDD ¡using ¡lineage. ¡

Performance ¡vs ¡Hadoop ¡ HadoopBinMem : ¡A ¡hadoop ¡deployment ¡that ¡converts ¡the ¡input ¡data ¡into ¡a ¡ low-‑overhead ¡binary ¡format ¡in ¡the ¡first ¡itera;on ¡to ¡eliminate ¡text ¡parsing ¡ in ¡later ¡ones, ¡and ¡stores ¡it ¡in ¡an ¡in-‑memory ¡HDFS ¡instance. ¡

Performance ¡vs ¡RAM ¡Size ¡ Entirely on disk Itera;on ¡;mes ¡for ¡logis;c ¡regression ¡using ¡100 ¡GB ¡data ¡on ¡25 ¡machines ¡with ¡varying ¡amounts ¡of ¡ data ¡in ¡memory. ¡ ¡ Spills ¡data ¡to ¡disk ¡or ¡re-‑computes ¡the ¡par;;ons ¡that ¡don't ¡fit ¡in ¡RAM ¡each ¡;me ¡they ¡are ¡requested ¡

Discussion ¡ • RDDs ¡can ¡express ¡numerous ¡systems: ¡ o MapReduce ¡ o DryadLINQ ¡ o Hive/SQL ¡(Shark) ¡ o Pregel ¡(200 ¡LOC) ¡ o Itera;ve ¡MapReduce ¡(200 ¡LOC) ¡ ¡ § e.g. ¡Haloop ¡

Pros ¡ ¡ ¡ ¡Cons ¡ • Expressive ¡ • Works ¡best ¡when ¡total ¡RAM ¡ • Good ¡for ¡batch ¡queries ¡ size ¡> ¡RDD ¡sizes ¡ • Minimize ¡Disk ¡I/O ¡ o Unclear ¡how ¡ performance ¡scales ¡over ¡ • Fast, ¡good ¡for... ¡ 1TB ¡data ¡sets ¡ o itera;ve ¡applica;ons ¡ Nondeterminis;c ¡func;ons ¡ • are ¡not ¡supported ¡ o interac;ve ¡queries ¡ • Fault-‑tolerant ¡ • Doesn't ¡work ¡with ¡ • Open-‑source ¡ asynchronous ¡fine-‑grained ¡ updates ¡ o e.g. ¡an ¡incremental ¡web ¡ crawler ¡

Take-‑away: ¡Hadoop ¡vs ¡Spark ¡ • Hadoop ¡ o (+) Good for batch jobs of arbitrary map/reduce functions (supports non-determinism) o ( - ) Very coarse data transformation model o (+) Highly supported, numerous resources available § Probably the reason it has so much momentum • Spark ¡ o (+) Good for iterative jobs with deterministic transformations o (+) Supports more transformations than M/R o ( - ) Relatively new, less support. Gaining traction

Demo ¡ 5 ¡Minute ¡Demo ¡of ¡Matei ¡doing ¡some ¡ ¡ queries ¡on ¡the ¡Wikipedia ¡dataset ¡on ¡ ¡ an ¡EC2 ¡cluster ¡from ¡NSDI ¡’12 ¡

Discre7zed ¡Streams ¡ An ¡Efficient ¡and ¡Fault-‑Tolerant ¡Model ¡ for ¡Stream ¡Processing ¡on ¡Large ¡Clusters ¡ UC ¡Berkeley, ¡AMP ¡Lab ¡ HotCloud ¡2012 ¡ Presented by: John Meehan

Stream ¡Processing ¡ • Con;nuous ¡queries ¡on ¡changing ¡dataset ¡ • High-‑velocity ¡datasets ¡ • Push-‑based ¡system ¡ • Streaming ¡datasets ¡ o Stock ¡;ckers ¡ o Social ¡media ¡data ¡(Twiher) ¡ o Sensor ¡data ¡ • Modern ¡distributed ¡stream ¡systems ¡ o Yahoo!’s ¡S4 ¡ o Twiher’s ¡Storm ¡

Streaming ¡Example ¡ SELECT ¡MIN(VALUE) ¡ FROM ¡WINDOW(TICKER, ¡3 ¡TUPLES) ¡ Window ¡ ¡ Data Flow Data Flow (3 ¡tuples) ¡

Streaming ¡Example ¡ SELECT ¡MIN(VALUE) ¡ FROM ¡WINDOW(TICKER, ¡3 ¡TUPLES) ¡ Window ¡ ¡ Data Flow Data Flow (3 ¡tuples) ¡ MINIMUM Query Output

Streaming ¡Example ¡ SELECT ¡MIN(VALUE) ¡ FROM ¡WINDOW(TICKER, ¡3 ¡TUPLES) ¡ Window ¡ ¡ Data Flow Data Flow (3 ¡tuples) ¡ TICKER VALUE MSFT $70.28 TICKER VALUE MSFT $70.84 MINIMUM Query Output TICKER VALUE MSFT $70.55

Streaming ¡Example ¡ TICKER VALUE Window ¡ ¡ Data Flow Data Flow MSFT $70.28 (3 ¡tuples) ¡ MSFT $70.84 MSFT $70.55 $70.28 MINIMUM Query Output

Streaming ¡Example ¡ TICKER VALUE Window ¡ ¡ Data Flow Data Flow MSFT $70.28 (3 ¡tuples) ¡ MSFT $70.84 MSFT $70.55 TICKER VALUE MSFT $70.43 $70.28 MINIMUM Query Output

Streaming ¡Example ¡ TICKER VALUE Window ¡ ¡ Data Flow Data Flow MSFT $70.84 (3 ¡tuples) ¡ MSFT $70.55 MSFT $70.43 $70.43 MINIMUM Query Output

Cloud ¡Distribu7on ¡Challenges ¡ • Consistency ¡ o Global ¡state ¡difficult ¡to ¡achieve ¡ • Fault ¡tolerance ¡ o Replica;on ¡and ¡upstream ¡backup ¡ o Slow ¡and ¡expensive ¡ • Unifica;on ¡of ¡batch ¡processing ¡ o Event-‑driven ¡systems ¡require ¡separate ¡API ¡ o Difficult ¡to ¡combine ¡streaming ¡with ¡historical ¡data ¡

D-‑Streams: ¡Discre7zed ¡Streams ¡ • Built ¡on ¡Spark ¡(aka ¡Spark ¡Streaming) ¡ • Treats ¡streaming ¡computa;ons ¡as ¡a ¡series ¡of ¡ determinis;c ¡batch ¡computa;ons ¡ • Tuples ¡are ¡divided ¡into ¡small ¡;me ¡intervals ¡ • Parallelizable ¡opera;ons ¡transform ¡input ¡data ¡ • Major ¡advantages ¡ o Consistency ¡is ¡well-‑defined ¡ o Processing ¡model ¡is ¡easy ¡to ¡unify ¡with ¡batch ¡systems ¡

Waits ¡for ¡Time ¡ Interval, ¡collec;ng ¡ tuples ¡ OUTPUT ¡ INPUT ¡ PARALLELIZABLE ¡ DATA ¡ DATA ¡ TRANSFORMATIONS ¡ TICKER VALUE MSFT $70.28 TICKER VALUE APPL $104.38 TICKER VALUE GOOG $89.33

Sends ¡all ¡tuples ¡ ¡ as ¡a ¡batch ¡ OUTPUT ¡ INPUT ¡ PARALLELIZABLE ¡ DATA ¡ DATA ¡ TRANSFORMATIONS ¡ TICKER VALUE MSFT $70.28 APPL $104.38 GOOG $89.33

Low ¡latency ¡in ¡a ¡batch ¡system ¡ • Tradi;onal ¡batch ¡systems ¡(Hadoop): ¡store ¡ intermediate ¡state ¡on ¡disk ¡ o Tens ¡of ¡seconds ¡latency… ¡ ¡ o Too ¡slow ¡for ¡streaming ¡ • Key-‑value ¡store ¡expensive ¡due ¡to ¡replica;on ¡ • Solu;on: ¡RDDs ¡ o Keeps ¡state ¡in ¡memory ¡ o Allows ¡for ¡inexpensive ¡parallel ¡recovery ¡

Spark and Friends Presented by: Jeff Rasley & John - PowerPoint PPT Presentation

Spark and Friends Presented by: Jeff Rasley & John Meehan Resilient Distributed Datasets: A Fault-Tolerant Abstrac;on for In-Memory Cluster

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

There is nothing wrong with having friends! There is nothing wrong with having friends.

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Architectural Knowledge and Organizational Context: The Case for Socio-Technical Styles James

Non-Wafer-Scale Sieving Hardware for the NFS: Another Attempt to Cope with 1024-bit Willi

Type Qualifiers and Security This presentation will discuss two papers that use qualifiers for

Brought To You By Your Moderator Your Presenter Account Manager Founder 1. Why Were Talking

Pricing Derivatives with Barriers in a Stochastic Interest Rate Environment Carole Bernard

FIRST FRAMEWORK ON SHaRK OS Mlardalen University Giuseppe Lipari, Michael Trimarchi The

Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor:

First-Timer's Guide to the 2017 National Brownfields Training Conference We Webinar Presenters

Spark and Friends Presented by: Jeff Rasley & John - PowerPoint PPT Presentation

Spark and Friends Presented by: Jeff Rasley & John Meehan Resilient Distributed Datasets: A Fault-Tolerant Abstrac;on for In-Memory Cluster

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

There is nothing wrong with having friends! There is nothing wrong with having friends.

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Architectural Knowledge and Organizational Context: The Case for Socio-Technical Styles James

Non-Wafer-Scale Sieving Hardware for the NFS: Another Attempt to Cope with 1024-bit Willi

Type Qualifiers and Security This presentation will discuss two papers that use qualifiers for

Brought To You By Your Moderator Your Presenter Account Manager Founder 1. Why Were Talking

Pricing Derivatives with Barriers in a Stochastic Interest Rate Environment Carole Bernard

FIRST FRAMEWORK ON SHaRK OS Mlardalen University Giuseppe Lipari, Michael Trimarchi The

Spark &amp; Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor:

First-Timer's Guide to the 2017 National Brownfields Training Conference We Webinar Presenters

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor: