Unified Big Data nified Big Data Pr Processing ocessing with - - PowerPoint PPT Presentation

unified big data nified big data pr processing ocessing
SMART_READER_LITE
LIVE PREVIEW

Unified Big Data nified Big Data Pr Processing ocessing with - - PowerPoint PPT Presentation

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more


slide-1
SLIDE 1
slide-2
SLIDE 2

Unified Big Data nified Big Data Pr Processing

  • cessing

with with Apache Spark pache Spark

Matei Zaharia @matei_zaharia

slide-3
SLIDE 3

What is Apache Spark?

Fast & general engine for big data processing Generalizes MapReduce model to support more types

  • f processing

Most active open source project in big data

slide-4
SLIDE 4

About Databricks

Founded by the creators of Spark in 2013 Continues to drive open source Spark development, and offers a cloud service (Databricks Cloud) Partners to support Spark with Cloudera, MapR, Hortonworks, Datastax

slide-5
SLIDE 5

Spark Community

MapReduce YARN HDFS Storm Spark 500 1000 1500 2000 MapReduce YARN HDFS Storm Spark 50000 100000 150000 200000 250000 300000 350000

Commits Lines of Code Changed

Activity in past 6 months

slide-6
SLIDE 6

Community Growth

25 50 75 100 2010 2011 2012 2013 2014

Contributor Contributors per s per M Month

  • nth to Spark

to Spark

2-3x more activity than Hadoop, Storm, MongoDB, NumPy, D3, Julia, …

slide-7
SLIDE 7

Overview

Why a unified engine? Spark execution model Why was Spark so general? What’s next

slide-8
SLIDE 8

History: Cluster Programming Models

2004

slide-9
SLIDE 9

MapReduce

A general engine for batch processing

slide-10
SLIDE 10

Beyond MapReduce

MapReduce was great for batch processing, but users quickly needed to do more:

> More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing

Result: many specialized systems for these workloads

slide-11
SLIDE 11

MapReduce Pregel Giraph Presto Storm Dremel Drill Impala S4 . . . Specialized systems for new workloads General batch processing

Big Data Systems Today

slide-12
SLIDE 12

Problems with Specialized Systems

More systems to manage, tune, deploy Can’t combine processing types in one application

> Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning

In many pipelines, data exchange between engines is the dominant cost!

slide-13
SLIDE 13

MapReduce Pregel Giraph Presto Storm Dremel Drill Impala S4 Specialized systems for new workloads General batch processing Unified engine

Big Data Systems Today

?

. . .

slide-14
SLIDE 14

Overview

Why a unified engine? Spark execution model Why was Spark so general? What’s next

slide-15
SLIDE 15

Background

Recall 3 workloads were issues for MapReduce:

> More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing

While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing

slide-16
SLIDE 16

Data Sharing in MapReduce

  • iter. 1
  • iter. 2

. . . . . .

Input

HDFS read HDFS write HDFS read HDFS write

Input

query 1 query 2 query 3

result 1 result 2 result 3

. . . . . .

HDFS read

Slow due to data replication and disk I/O

slide-17
SLIDE 17
  • iter. 1
  • iter. 2

. . . . . .

Input

What We’d Like

Distributed memory Input

query 1 query 2 query 3

. . . . . .

  • ne-time

processing

10-100× faster than network and disk

slide-18
SLIDE 18

Spark Model

Resilient Distributed Datasets (RDDs)

> Collections of objects that can be stored in memory or disk across a cluster > Built via parallel transformations (map, filter, …) > Fault-tolerant without replication

slide-19
SLIDE 19

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘\t’)[2]) messages.cache() Block 1 Block 2 Block 3

Worker Worker Worker Driver

messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . .

tasks results

Cache 1 Cache 2 Cache 3

Base RDD Transformed RDD Action

Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

slide-20
SLIDE 20

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

Fault Tolerance

filter reduce map Input file

RDDs track lineage info to rebuild lost data

slide-21
SLIDE 21

filter reduce map Input file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

slide-22
SLIDE 22

500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 Running Running Time ( ime (s) ) Number umber of I

  • f Iter

terations ations Hadoop Spark 110 s / iteration first iteration 80 s later iterations 1 s

Example: Logistic Regression

slide-23
SLIDE 23

// Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();

Spark in Scala and Java

slide-24
SLIDE 24

How General Is It?

slide-25
SLIDE 25

Spark Core Spark Streaming

real-time

Spark SQL

relational

MLlib

machine learning

GraphX

graph

Libraries Built on Spark

slide-26
SLIDE 26

Represents tables as RDDs Tables = Schema + Data

Spark SQL

slide-27
SLIDE 27

Represents tables as RDDs Tables = Schema + Data = SchemaRDD

Spark SQL

c = HiveContext(sc) rows = c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year > 2013).collect()

From Hive:

{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }}

c.jsonFile(“tweets.json”).registerTempTable(“tweets”) c.sql(“select text, user.name from tweets”)

From JSON:

tweets.json

slide-28
SLIDE 28

Time ime Input

Spark Streaming

slide-29
SLIDE 29

RDD RDD RDD RDD RDD RDD Time ime

Represents streams as a series of RDDs over time

val spammers = sc.sequenceFile(“hdfs://spammers.seq”) sc.twitterStream(...) .filter(t => t.text.contains(“QCon”)) .transform(tweets => tweets.map(t => (t.user, t)).join(spammers)) .print()

Spark Streaming

slide-30
SLIDE 30

Vectors, Matrices

MLlib

slide-31
SLIDE 31

Vectors, Matrices = RDD[Vector] Iterative computation

MLlib

points = sc.textFile(“data.txt”).map(parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)

slide-32
SLIDE 32

Represents graphs as RDDs of edges and vertices

GraphX

slide-33
SLIDE 33

Represents graphs as RDDs of edges and vertices

GraphX

slide-34
SLIDE 34

Represents graphs as RDDs of edges and vertices

GraphX

slide-35
SLIDE 35

// Load data using SQL val val points = ctx.sql( “select latitude, longitude from historic_tweets”) // Train a machine learning model val val model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

Combining Processing Types

slide-36
SLIDE 36

Composing Workloads

Separate systems:

. . .

HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query HDFS write HDFS read ETL train query

Spark:

slide-37
SLIDE 37

Hive Impala (disk) Impala (mem) Spark (disk) Spark (mem) 10 20 30 40 50 Response Response Time ( ime (sec sec) )

SQL

Mahout GraphLab Spark 10 20 30 40 50 60 Response Response Time (min ime (min) )

ML

Performance vs Specialized Systems

Storm Spark 5 10 15 20 25 30 35 Thr hroughput (MB/

  • ughput (MB/s/

s/node node) )

Streaming

slide-38
SLIDE 38

On-Disk Performance: Petabyte Sort

Spark beat last year’s Sort Benchmark winner, Hadoop, by 3× using 10× fewer machines

2013 Recor 2013 Record ( d (Hadoop Hadoop) ) Spark Spark 1 100 00 TB TB Spark Spark 1 PB 1 PB Data Size 102.5 TB 100 TB 1000 TB Time 72 min 23 min 234 min Nodes 2100 206 190 Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min

tinyurl.com/spark-sort

slide-39
SLIDE 39

Overview

Why a unified engine? Spark execution model Why was Spark so general? What’s next

slide-40
SLIDE 40

Why was Spark so General?

In a world of growing data complexity, understanding this can help us design new tools / pipelines Two perspectives:

> Expressiveness perspective > Systems perspective

slide-41
SLIDE 41
  • 1. Expressiveness Perspective

Spark ≈ MapReduce + fast data sharing

slide-42
SLIDE 42
  • 1. Expressiveness Perspective

How to share data quickly across steps? Local computation All-to-all communication One MR step

How low is this latency? Spark: RDDs Spark: ~100 ms

MapReduce can emulate any distributed system!

slide-43
SLIDE 43
  • 2. Systems Perspective

Main bottlenecks in clusters are network and I/O Any system that lets apps control these resources can match speed of specialized ones In Spark:

> Users control data partitioning & caching > We implement the data structures and algorithms of specialized systems within Spark records

slide-44
SLIDE 44

Examples

Spark SQL

> A SchemaRDD holds records for each chunk of data (multiple rows), with columnar compression

GraphX

> GraphX represents graphs as an RDD of HashMaps so that it can join quickly against each partition

slide-45
SLIDE 45

Result

Spark can leverage most of the latest innovations in databases, graph processing, machine learning, … Users get a single API that composes very efficiently

More info: tinyurl.com/matei-thesis

slide-46
SLIDE 46

Overview

Why a unified engine? Spark execution model Why was Spark so general? What’s next

slide-47
SLIDE 47

What’s Next for Spark

While Spark has been around since 2009, many pieces are just beginning 300 contributors, 2 whole libraries new this year Big features in the works

slide-48
SLIDE 48

Spark 1.2 (Coming in Dec)

New machine learning pipelines API

> Featurization & parameter search, similar to SciKit-Learn

Python API for Spark Streaming Spark SQL pluggable data sources

> Hive, JSON, Parquet, Cassandra, ORC, …

Scala 2.11 support

slide-49
SLIDE 49

Beyond Hadoop

Batch Interactive Streaming Hadoop Cassandra Mesos

… …

Public Clouds Your

  • ur

application application here here

Unified API across workloads, storage systems and environments

slide-50
SLIDE 50

Learn More

Downloads and tutorials: spark.apache.org Training: databricks.com/training (free videos) Databricks Cloud: databricks.com/cloud

slide-51
SLIDE 51

www.spark-summit.org

slide-52
SLIDE 52