Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop - - PowerPoint PPT Presentation

spark spark sql
SMART_READER_LITE
LIVE PREVIEW

Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop - - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau 1 Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (TGI


slide-1
SLIDE 1

Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (TGI Fridays)

Spark & Spark SQL

High-Speed In-Memory Analytics


  • ver Hadoop and Hive Data

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech

Instructor: Duen Horng (Polo) Chau

1

slide-2
SLIDE 2

What is Spark ?

Not a modified version of Hadoop Separate, fast, MapReduce-like engine

» In-memory data storage for very fast iterative queries » General execution graphs and powerful optimizations » Up to 40x faster than Hadoop

Compatible with Hadoop’s storage APIs

» Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

http://spark.apache.org

2

slide-3
SLIDE 3

What is Spark SQL? 


(Formally called Shark)

Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x

3

slide-4
SLIDE 4

Project History [latest: v1.1]

Spark project started in 2009 at UC Berkeley AMP lab, 


  • pen sourced 2010

Became Apache Top-Level Project in Feb 2014 Shark/Spark SQL started summer 2011 Built by 250+ developers and people from 50 companies Scale to 1000+ nodes in production In use at Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research, …

UC BERKELEY

http://en.wikipedia.org/wiki/Apache_Spark 4

slide-5
SLIDE 5

Why a New Programming Model?

MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more:

» More complex, multi-stage applications (e.g.
 iterative graph algorithms and machine learning) » More interactive ad-hoc queries

5

slide-6
SLIDE 6

Why a New Programming Model?

MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more:

» More complex, multi-stage applications (e.g.
 iterative graph algorithms and machine learning) » More interactive ad-hoc queries

Require faster data sharing across parallel jobs

5

slide-7
SLIDE 7

Is MapReduce dead?

Up for debate… as of 10/7/2014

http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/

http://www.datacenterknowledge.com/archives/ 2014/06/25/google-dumps-mapreduce-favor-new- hyper-scale-analytics-system/

6

slide-8
SLIDE 8

Data Sharing in MapReduce

  • iter. 1
  • iter. 2

. . . Input

HDFS
 read HDFS
 write HDFS
 read HDFS
 write

Input query 1 query 2 query 3 result 1 result 2 result 3 . . .

HDFS
 read

7

slide-9
SLIDE 9

Data Sharing in MapReduce

  • iter. 1
  • iter. 2

. . . Input

HDFS
 read HDFS
 write HDFS
 read HDFS
 write

Input query 1 query 2 query 3 result 1 result 2 result 3 . . .

HDFS
 read

Slow due to replication, serialization, and disk IO

7

slide-10
SLIDE 10
  • iter. 1
  • iter. 2

. . . Input

Data Sharing in Spark

Distributed
 memory Input query 1 query 2 query 3 . . .

  • ne-time


processing

8

slide-11
SLIDE 11
  • iter. 1
  • iter. 2

. . . Input

Data Sharing in Spark

Distributed
 memory Input query 1 query 2 query 3 . . .

  • ne-time


processing

10-100× faster than network and disk

8

slide-12
SLIDE 12

Spark Programming Model

Key idea: resilient distributed datasets (RDDs)

»Distributed collections of objects that can be cached in memory across cluster nodes »Manipulated through various parallel operators »Automatically rebuilt on failure

Interface

»Clean language-integrated API in Scala »Can be used interactively from Scala, Python console »Supported languages: Java, Scala, Python, R

9

slide-13
SLIDE 13

http://www.scala-lang.org/old/faq/4 
 Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations 10

slide-14
SLIDE 14

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

11

http://www.slideshare.net/normation/scala-dreaded

slide-15
SLIDE 15

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

Worker Worker Worker Driver

11

http://www.slideshare.net/normation/scala-dreaded

slide-16
SLIDE 16

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker Worker Worker Driver

11

http://www.slideshare.net/normation/scala-dreaded

slide-17
SLIDE 17

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker Worker Worker Driver

Base RDD

11

http://www.slideshare.net/normation/scala-dreaded

slide-18
SLIDE 18

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker Worker Worker Driver

11

http://www.slideshare.net/normation/scala-dreaded

slide-19
SLIDE 19

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker Worker Worker Driver

Transformed RDD

11

http://www.slideshare.net/normation/scala-dreaded

slide-20
SLIDE 20

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker Worker Worker Driver

11

http://www.slideshare.net/normation/scala-dreaded

slide-21
SLIDE 21

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count

11

http://www.slideshare.net/normation/scala-dreaded

slide-22
SLIDE 22

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count

Action

11

http://www.slideshare.net/normation/scala-dreaded

slide-23
SLIDE 23

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count

11

http://www.slideshare.net/normation/scala-dreaded

slide-24
SLIDE 24

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count

11

http://www.slideshare.net/normation/scala-dreaded

slide-25
SLIDE 25

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count tasks

11

http://www.slideshare.net/normation/scala-dreaded

slide-26
SLIDE 26

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count tasks results

11

http://www.slideshare.net/normation/scala-dreaded

slide-27
SLIDE 27

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count tasks results

Cache 1 Cache 2 Cache 3

11

http://www.slideshare.net/normation/scala-dreaded

slide-28
SLIDE 28

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count

Cache 1 Cache 2 Cache 3

11

http://www.slideshare.net/normation/scala-dreaded

slide-29
SLIDE 29

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count

Cache 1 Cache 2 Cache 3

11

http://www.slideshare.net/normation/scala-dreaded

slide-30
SLIDE 30

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . .

Cache 1 Cache 2 Cache 3

11

http://www.slideshare.net/normation/scala-dreaded

slide-31
SLIDE 31

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . .

Cache 1 Cache 2 Cache 3

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

11

http://www.slideshare.net/normation/scala-dreaded

slide-32
SLIDE 32

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block 1 Block 2 Block 3

Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . .

Cache 1 Cache 2 Cache 3

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec
 (vs 170 sec for on-disk data)

11

http://www.slideshare.net/normation/scala-dreaded

slide-33
SLIDE 33

Fault Tolerance

RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDD

path = hdfs://…

FilteredRDD

func = _.contains(...)

MappedRDD

func = _.split(…)

12

slide-34
SLIDE 34

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

13

slide-35
SLIDE 35

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Load data in memory once

13

slide-36
SLIDE 36

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Initial parameter vector

13

slide-37
SLIDE 37

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Repeated MapReduce steps
 to do gradient descent

13

slide-38
SLIDE 38

Logistic Regression Performance

1000 2000 3000 4000 Number of Iterations 1 5 10 20 30 Hadoop Spark

127 s / iteration first iteration 174 s further iterations 6 s

14

slide-39
SLIDE 39

Supported Operators

map filter groupBy sort join leftOuterJoin rightOuterJoin reduce count reduceByKey groupByKey first union cross sample cogroup take partitionBy pipe save ...

15

slide-40
SLIDE 40

Spark Users

16

slide-41
SLIDE 41

Spark SQL: Hive on Spark

17

slide-42
SLIDE 42

Motivation

Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes Scala is good for programmers, but many data users only know SQL Can we extend Hive to run on Spark?

18

slide-43
SLIDE 43

Hive Architecture

Meta store HDFS Client Driver SQL Parser Query Optimizer Physical Plan Execution CLI JDBC MapReduce

19

slide-44
SLIDE 44

Spark SQL Architecture

Meta store HDFS Client Driver SQL Parser Physical Plan Execution CLI JDBC Spark Cache Mgr. Query Optimizer

[Engle et al, SIGMOD 2012]

20

slide-45
SLIDE 45

Efficient In-Memory Storage

Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Spark SQL employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3 john mike sally 4.1 3.5 6.4

Row Storage

1 john 4.1 2 mike 3.5 3 sally 6.4

21

slide-46
SLIDE 46

Efficient In-Memory Storage

Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Spark SQL employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3 john mike sally 4.1 3.5 6.4

Row Storage

1 john 4.1 2 mike 3.5 3 sally 6.4

Benefit: similarly compact size to serialized data,
 but >5x faster to access

22

slide-47
SLIDE 47

Using Spark SQL

CREATE TABLE mydata_cached AS SELECT …

Run standard HiveQL on it, including UDFs

» A few esoteric features are not yet supported

Can also call from Scala to mix with Spark

23

slide-48
SLIDE 48

Benchmark Query 1

SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

24

slide-49
SLIDE 49

Benchmark Query 2

SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings
 FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL
 WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’
 GROUP BY V.sourceIP
 ORDER BY earnings DESC
 LIMIT 1;

25

slide-50
SLIDE 50

What’s Next?

Recall that Spark’s model was motivated by two emerging uses (interactive and multi-stage apps) Another emerging use case that needs fast data sharing is stream processing

» Track and update state in memory as events arrive » Large-scale reporting, click analysis, spam filtering, etc

26

slide-51
SLIDE 51

Streaming Spark

Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1))
 .reduceByWindow(“5s”, _ + _)

T=1 T=2 … map reduceByWindow

[Zaharia et al, HotCloud 2012]

27

slide-52
SLIDE 52

Streaming Spark

Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1))
 .reduceByWindow(5, _ + _)

T=1 T=2 … map reduceByWindow

[Zaharia et al, HotCloud 2012]

Result: can process 42 million records/second
 (4 GB/s) on 100 nodes at sub-second latency

28

slide-53
SLIDE 53

Streaming Spark

Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1))
 .reduceByWindow(5, _ + _)

T=1 T=2 … map reduceByWindow

[Zaharia et al, HotCloud 2012]

29

slide-54
SLIDE 54

Spark Streaming

Create and operate on RDDs from live data streams at set intervals Data is divided into batches for processing Streams may be combined as a part of processing or analyzed with higher level transforms

30

slide-55
SLIDE 55

Behavior with Not Enough RAM

Iteration time (s) 25 50 75 100 % of working set in memory Cache disabled 25% 50% 75% Fully cached 11.5 29.7 40.7 58.1 68.8

31

slide-56
SLIDE 56

SPARK PLATFORM

32

Standard FS/HDFS/CFS/S3 GraphX Spark SQL Shark Spark Streaming YARN/Spark/Mesos Scala/Python/Java RDD MLlib

Execution

Resource Management

Data Storage

slide-57
SLIDE 57

33

slide-58
SLIDE 58

MLlib

Scalable machine learning library Interoperates with NumPy Available algorithms in 1.0

» Linear Support Vector Machine (SVM) » Logistic Regression » Linear Least Squares » Decision Trees » Naïve Bayes » Collaborative Filtering with ALS » K-means » Singular Value Decomposition » Principal Component Analysis » Gradient Descent

34

slide-59
SLIDE 59

GraphX

Parallel graph processing Extends RDD -> Resilient Distributed Property Graph

» Directed multigraph with properties attached to each vertex and edge

Limited algorithms

» PageRank » Connected Components » Triangle Counts

Alpha component

35

slide-60
SLIDE 60

Commercial Support

Databricks

» Not to be confused with DataStax » Found by members of the AMPLab » Offering

  • Certification
  • Training
  • Support
  • DataBricks Cloud

36