Making Big Data Processing Simple with Spark Matei Zaharia - PowerPoint PPT Presentation

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015

What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast to process large datasets • High-level APIs in Java, Scala, Python, R • Unified engine that can capture many workloads

A Unified Engine Spark MLlib Spark SQL GraphX Streaming machine structured data graph learning real-time Spark

A Large Community Contributors / Month to Spark 160 140 120 Contributors 100 Most active open source project for 80 big data 60 40 20 0 2010 2011 2012 2013 2014 2015

Overview Why a unified engine? Spark programming model Built-in libraries Applications

History: Cluster Computing 2004

MapReduce A general engine for batch processing

Beyond MapReduce MapReduce was great for batch processing, but users quickly needed to do more: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing Result: specialized systems for these workloads

Big Data Systems Today Pregel Giraph Dremel Drill MapReduce Presto Impala S4 . . . Storm General batch Specialized systems processing for new workloads

Problems with Specialized Systems More systems to manage, tune, deploy Can’t easily combine processing types • Even though most applications need to do this! • E.g. load data with SQL, then run machine learning In many cases, data transfer between engines is a dominant cost!

Big Data Systems Today Pregel Giraph ? Dremel Drill MapReduce Presto Impala . . . Storm S4 General batch Unified engine Specialized systems processing for new workloads

Background Recall 3 workloads were issues for MapReduce: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing While these look di ff erent, all 3 need one thing that MapReduce lacks: e ff icient data sharing

Data Sharing in MapReduce HDFS HDFS HDFS HDFS read write read write . . . . . . iter. 1 iter. 2 Input result 1 query 1 HDFS read result 2 query 2 query 3 result 3 Input . . . . . . Slow due to replication and disk I/O

What We’d Like iter. 1 iter. 2 . . . . . . Input query 1 one-time processing query 2 query 3 Input Distributed . . . . . . memory 10-100x faster than network and disk

Spark Programming Model Resilient Distributed Datasets (RDDs) • Collections of objects stored in RAM or disk across cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) Worker results errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘\t’)[2]) tasks Block 1 Driver messages.cache() Action messages.filter(lambda s: “MySQL” in s).count() Cache 2 messages.filter(lambda s: “Redis” in s).count() Worker . . . Cache 3 Block 2 Worker Example: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data) Block 3

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

Example: Logistic Regression 4000 3500 110 s / iteration Running Time (s) Running Time (s) 3000 2500 Hadoop 2000 1500 Spark 1000 500 first iteration 80 s 0 further iterations 1 s 1 5 10 20 30 Number of Iterations Number of Iterations

On-Disk Performance Time to sort 100TB 2013 Record: 2100 machines Hadoop 72 minutes 2014 Record: 207 machines Spark 23 minutes Source: Daytona GraySort benchmark, sortbenchmark.org

Libraries Built on Spark Spark MLlib Spark SQL GraphX Streaming machine structured data graph learning real-time Spark

Combining Processing Types // Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Combining Processing Types Separate systems: query train ETL HDFS HDFS HDFS HDFS HDFS HDFS read write read write read write . . . Spark: query train ETL HDFS HDFS read write

Response Time (sec) Response Time (sec) Performance vs Specialized Systems 10 20 30 40 50 0 Hive Impala (disk) SQL Impala (mem) Spark (disk) Spark (mem) Throughput (MB/s/node) Throughput (MB/s/node) 10 15 20 25 30 35 0 5 Streaming Storm Spark Response Time (min) Response Time (min) 10 20 30 40 50 60 0 Mahout ML GraphLab Spark

Some Recent Additions DataFrame API (similar to R and Pandas) • Easy programmatic way to work with structured data R interface (SparkR) Machine learning pipelines (like SciKit-learn)

Spark Community Over 1000 deployments, clusters up to 8000 nodes Many talks online at spark-summit.org

Top Applications Business Intelligence 68% Data Warehousing 52% Recommendation 44% Log Processing 40% User-Facing Services 36% Faud Detection / Security 29%

Spark Components Used Spark SQL 69% 75% DataFrames 62% of users use more Spark Streaming 58% than one component MLlib + GraphX 58%

Learn More Get started on your laptop: spark.apache.org Resources and MOOCs: sparkhub.databricks.com Spark Summit: spark-summit.org

Thank You

Making Big Data Processing Simple with Spark Matei Zaharia - PowerPoint PPT Presentation

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast to process large datasets High-level

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Freedom of Information Act Advisory Committee March 20, 2019 1 A Snapshot of FOIA

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Language What is Processing Object Complete orientated with IDE Open source Java Based Can

Out-of of-GPU-Memory ry Graph Processing Amir Hossein Nodehi Sabet, Zhijia Zhao, Rajiv Gupta

The Seven GDPR Sins of Personal-Data Processing Systems Supreeth Shastri , Melissa Wasserman,

RedisGears Redis in memory data processing JUNE 2019 | PIETER CAILLIAU About me Produced

You u get t an n A!: Teaching g Students s to o Process s Archival Collections s in n the

Private Aids to Navigation Training Guide APRIL, 2013 November 2009 District 1SR Navigation

Making Big Data Processing Simple with Spark Matei Zaharia - PowerPoint PPT Presentation

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast to process large datasets High-level

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Freedom of Information Act Advisory Committee March 20, 2019 1 A Snapshot of FOIA

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

Language What is Processing Object Complete orientated with IDE Open source Java Based Can

Out-of of-GPU-Memory ry Graph Processing Amir Hossein Nodehi Sabet, Zhijia Zhao, Rajiv Gupta

The Seven GDPR Sins of Personal-Data Processing Systems Supreeth Shastri , Melissa Wasserman,

RedisGears Redis in memory data processing JUNE 2019 | PIETER CAILLIAU About me Produced

You u get t an n A!: Teaching g Students s to o Process s Archival Collections s in n the

Private Aids to Navigation Training Guide APRIL, 2013 November 2009 District 1SR Navigation

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1