Making Big Data Processing Simple with Spark Matei Zaharia - - PowerPoint PPT Presentation

making big data processing simple with spark
SMART_READER_LITE
LIVE PREVIEW

Making Big Data Processing Simple with Spark Matei Zaharia - - PowerPoint PPT Presentation

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast to process large datasets High-level


slide-1
SLIDE 1

Making Big Data Processing Simple with Spark

Matei Zaharia

December 17, 2015

slide-2
SLIDE 2

What is Apache Spark?

Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast to process large datasets

  • High-level APIs in Java, Scala, Python, R
  • Unified engine that can capture many workloads
slide-3
SLIDE 3

A Unified Engine

Spark Spark Streaming

real-time

Spark SQL

structured data

MLlib

machine learning

GraphX

graph

slide-4
SLIDE 4

20 40 60 80 100 120 140 160 2010 2011 2012 2013 2014 2015 Contributors

Contributors / Month to Spark

A Large Community

Most active open source project for big data

slide-5
SLIDE 5

Overview

Why a unified engine? Spark programming model Built-in libraries Applications

slide-6
SLIDE 6

History: Cluster Computing

2004

slide-7
SLIDE 7

A general engine for batch processing

MapReduce

slide-8
SLIDE 8

Beyond MapReduce

MapReduce was great for batch processing, but users quickly needed to do more:

  • More complex, multi-pass algorithms
  • More interactive ad-hoc queries
  • More real-time stream processing

Result: specialized systems for these workloads

slide-9
SLIDE 9

MapReduce Pregel Dremel Presto Storm Giraph Drill Impala S4 . . . Specialized systems for new workloads General batch processing

Big Data Systems Today

slide-10
SLIDE 10

Problems with Specialized Systems

More systems to manage, tune, deploy Can’t easily combine processing types

  • Even though most applications need to do this!
  • E.g. load data with SQL, then run machine learning

In many cases, data transfer between engines is a dominant cost!

slide-11
SLIDE 11

MapReduce Pregel Dremel Presto Storm Giraph Drill Impala S4 Specialized systems for new workloads General batch processing Unified engine

Big Data Systems Today

?

. . .

slide-12
SLIDE 12

Overview

Why a unified engine? Spark programming model Built-in libraries Applications

slide-13
SLIDE 13

Background

Recall 3 workloads were issues for MapReduce:

  • More complex, multi-pass algorithms
  • More interactive ad-hoc queries
  • More real-time stream processing

While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing

slide-14
SLIDE 14

Data Sharing in MapReduce

  • iter. 1
  • iter. 2

. . . . . . Input

HDFS read HDFS write HDFS read HDFS write

Input query 1 query 2 query 3 result 1 result 2 result 3 . . . . . .

HDFS read

Slow due to replication and disk I/O

slide-15
SLIDE 15
  • iter. 1
  • iter. 2

. . . . . . Input

What We’d Like

Distributed memory Input query 1 query 2 query 3 . . . . . .

  • ne-time

processing

10-100x faster than network and disk

slide-16
SLIDE 16

Spark Programming Model

Resilient Distributed Datasets (RDDs)

  • Collections of objects stored in RAM or disk across cluster
  • Built via parallel transformations (map, filter, …)
  • Automatically rebuilt on failure
slide-17
SLIDE 17

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘\t’)[2]) messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

messages.filter(lambda s: “MySQL” in s).count() messages.filter(lambda s: “Redis” in s).count() . . .

tasks results

Cache 1 Cache 2 Cache 3

Base RDD Transformed RDD Action

Example: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

slide-18
SLIDE 18

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map Input file

RDDs track lineage info to rebuild lost data

slide-19
SLIDE 19

filter reduce map Input file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

slide-20
SLIDE 20

Example: Logistic Regression

500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 Running Time (s) Running Time (s) Number of Iterations Number of Iterations Hadoop Spark

110 s / iteration first iteration 80 s further iterations 1 s

slide-21
SLIDE 21

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines

2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines 23 minutes

On-Disk Performance

Time to sort 100TB

slide-22
SLIDE 22

Libraries Built on Spark

Spark Spark Streaming

real-time

Spark SQL

structured data

MLlib

machine learning

GraphX

graph

slide-23
SLIDE 23

// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Combining Processing Types

slide-24
SLIDE 24

Combining Processing Types

Separate systems:

. . . HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query HDFS write HDFS read ETL train query

Spark:

slide-25
SLIDE 25

Hive Impala (disk) Impala (mem) Spark (disk) Spark (mem) 10 20 30 40 50 Response Time (sec) Response Time (sec)

SQL

Mahout GraphLab Spark 10 20 30 40 50 60 Response Time (min) Response Time (min)

ML

Performance vs Specialized Systems

Storm Spark 5 10 15 20 25 30 35 Throughput (MB/s/node) Throughput (MB/s/node)

Streaming

slide-26
SLIDE 26

Some Recent Additions

DataFrame API (similar to R and Pandas)

  • Easy programmatic way to work with structured data

R interface (SparkR) Machine learning pipelines (like SciKit-learn)

slide-27
SLIDE 27

Overview

Why a unified engine? Spark programming model Built-in libraries Applications

slide-28
SLIDE 28

Over 1000 deployments, clusters up to 8000 nodes

Spark Community

Many talks online at spark-summit.org

slide-29
SLIDE 29

Top Applications

29% 36% 40% 44% 52% 68% Faud Detection / Security User-Facing Services Log Processing Recommendation Data Warehousing Business Intelligence

slide-30
SLIDE 30

Spark Components Used

58% 58% 62% 69% MLlib + GraphX Spark Streaming DataFrames Spark SQL

75%

  • f users use more

than one component

slide-31
SLIDE 31

Learn More

Get started on your laptop: spark.apache.org Resources and MOOCs: sparkhub.databricks.com Spark Summit: spark-summit.org

slide-32
SLIDE 32

Thank You