Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde - - PowerPoint PPT Presentation

large scale data engineering
SMART_READER_LITE
LIVE PREVIEW

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde - - PowerPoint PPT Presentation

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK event.cwi.nl/lsde What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Up to 100 faster Improves efficiency


slide-1
SLIDE 1

event.cwi.nl/lsde

Large-Scale Data Engineering

Spark and MLLIB

slide-2
SLIDE 2

event.cwi.nl/lsde

OVERVIEW OF SPARK

slide-3
SLIDE 3

event.cwi.nl/lsde

What is Spark?

  • Fast and expressive cluster computing system interoperable with

Apache Hadoop

  • Improves efficiency through:

– In-memory computing primitives – General computation graphs

  • Improves usability through:

– Rich APIs in Scala, Java, Python – Interactive shell

Up to 100× faster (2-10× on disk) Often 5× less code

slide-4
SLIDE 4

event.cwi.nl/lsde

The Spark Stack

  • Spark is the basis of a wide set of projects in the Berkeley Data Analytics

Stack (BDAS)

Spark

Spark Streaming

(real-time)

GraphX

(graph)

Spark SQL MLIB

(machine learning)

More details: amplab.berkeley.edu

slide-5
SLIDE 5

event.cwi.nl/lsde

Why a New Programming Model?

  • MapReduce greatly simplified big data analysis
  • But as soon as it got popular, users wanted more:

– More complex, multi-pass analytics (e.g. ML, graph) – More interactive ad-hoc queries – More real-time stream processing

  • All 3 need faster data sharing across parallel jobs
slide-6
SLIDE 6

event.cwi.nl/lsde

Data Sharing in MapReduce

  • iter. 1
  • iter. 2

. . . Input

HDFS read HDFS write HDFS read HDFS write

Input query 1 query 2 query 3 result 1 result 2 result 3 . . .

HDFS read

Slow due to replication, serialization, and disk IO

slide-7
SLIDE 7

event.cwi.nl/lsde

  • iter. 1
  • iter. 2

. . . Input

Data Sharing in Spark

Distributed memory Input query 1 query 2 query 3 . . .

  • ne-time

processing

~10× faster than network and disk

slide-8
SLIDE 8

event.cwi.nl/lsde

Spark Programming Model

  • Key idea: resilient distributed datasets (RDDs)

– Distributed collections of objects that can be cached in memory across the cluster – Manipulated through parallel operators – Automatically recomputed on failure

  • Programming interface

– Functional APIs in Scala, Java, Python – Interactive use from Scala shell

slide-9
SLIDE 9

event.cwi.nl/lsde

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) messages.cache()

Base RDD Transformed RDD Worker Worker Worker Driver

slide-10
SLIDE 10

event.cwi.nl/lsde

Lambda Functions

Lambda function  functional programming! = implicit function definition

errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) bool detect_error(string x) { return x.startswith(“ERROR”); }

slide-11
SLIDE 11

event.cwi.nl/lsde

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) messages.cache() Block 1

Block 2 Block 3 Worker Worker Worker Driver

messages.filter(lambda x: “foo” in x).count messages.filter(lambda x: “bar” in x).count . . .

tasks results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

slide-12
SLIDE 12

event.cwi.nl/lsde

Fault Tolerance

  • file.map(lambda rec: (rec.type, 1))

.reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map Input file

RDDs track lineage info to rebuild lost data

slide-13
SLIDE 13

event.cwi.nl/lsde

filter reduce map Input file

  • file.map(lambda rec: (rec.type, 1))

.reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Fault Tolerance

slide-14
SLIDE 14

event.cwi.nl/lsde

Example: Logistic Regression

500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 Running Time (s) Number of Iterations Hadoop Spark

110 s / iteration first iteration 80 s further iterations 1 s

slide-15
SLIDE 15

event.cwi.nl/lsde

Spark in Scala and Java

// Scala: val val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(new new Function<String, Boolean>() { Boolean call(String s) { re retu turn rn s.contains(“error”); } }).count();

slide-16
SLIDE 16

event.cwi.nl/lsde

Supported Operators

  • map
  • filter
  • groupBy
  • sort
  • union
  • join
  • leftOuterJoin
  • rightOuterJoin
  • reduce
  • count
  • fold
  • reduceByKey
  • groupByKey
  • cogroup
  • cross
  • zip

sample take first partitionBy mapWith pipe save ... ...

slide-17
SLIDE 17

event.cwi.nl/lsde

Software Components

  • Spark client is library in user program (1 instance

per app)

  • Runs tasks locally or on cluster

– Mesos, YARN, standalone mode

  • Accesses storage systems via Hadoop

InputFormat API – Can use HBase, HDFS, S3, …

Your application SparkContext Local threads Cluster manager Worker

Spark executor

Worker

Spark executor

HDFS or other storage

slide-18
SLIDE 18

event.cwi.nl/lsde

Task Scheduler

General task graphs Automatically pipelines functions Data locality aware Partitioning aware to avoid shuffles

= cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map

slide-19
SLIDE 19

event.cwi.nl/lsde

Spark SQL

  • Columnar SQL analytics engine for Spark

– Support both SQL and complex analytics – Up to 100X faster than Apache Hive

  • Compatible with Apache Hive

– HiveQL, UDF/UDAF, SerDes, Scripts – Runs on existing Hive warehouses

  • In use at Yahoo! for fast in-memory OLAP
slide-20
SLIDE 20

event.cwi.nl/lsde

Hive Architecture

Hive Catalog HDFS Client

Driver SQL Parser Query Optimizer Physical Plan Execution CLI JDBC

MapReduce

slide-21
SLIDE 21

event.cwi.nl/lsde

Spark SQL Architecture

Hive Catalog HDFS Client

Driver SQL Parser Physical Plan Execution CLI JDBC

Spark

Cache Mgr. Query Optimizer

[Engle et al, SIGMOD 2012]

slide-22
SLIDE 22

event.cwi.nl/lsde

What Makes it Faster?

  • Lower-latency engine (Spark OK with 0.5s jobs)
  • Support for general DAGs
  • Column-oriented storage and compression
  • New optimizations (e.g. map pruning)
slide-23
SLIDE 23

event.cwi.nl/lsde

Other Spark Stack Projects

  • Spark Streaming: stateful, fault-tolerant stream processing (out since

Spark 0.7)

  • sc.twitterStream(...)

.flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

  • MLlib: Library of high-quality machine learning algorithms (out since 0.8)
slide-24
SLIDE 24

event.cwi.nl/lsde

Performance

Impala (disk) Impala (mem) Redshift Spark SQL (disk) Spark SQL (mem) 5 10 15 20 25 Response Time (s)

SQL

Storm Spark 5 10 15 20 25 30 35 Throughput (MB/s/node)

Streaming

Hadoop Giraph GraphX 5 10 15 20 25 30 Response Time (min)

Graph

slide-25
SLIDE 25

event.cwi.nl/lsde

What it Means for Users

  • Separate frameworks:

HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query

HDFS

HDFS read ETL train query

Spark:

slide-26
SLIDE 26

event.cwi.nl/lsde

Conclusion

  • Big data analytics is evolving to include:

– More complex analytics (e.g. machine learning) – More interactive ad-hoc queries – More real-time stream processing

  • Spark is a fast platform that unifies these apps
  • More info: spark-project.org
slide-27
SLIDE 27

event.cwi.nl/lsde

SPARK MLLIB

slide-28
SLIDE 28

event.cwi.nl/lsde

What is MLLIB?

MLlib is a Spark subproject providing machine learning primitives:

  • initial contribution from AMPLab, UC Berkeley
  • shipped with Spark since version 0.8
slide-29
SLIDE 29

event.cwi.nl/lsde

What is MLLIB?

Algorithms:

  • classification: logistic regression, linear support vector machine (SVM),

naive Bayes

  • regression: generalized linear regression (GLM)
  • collaborative filtering: alternating least squares (ALS)
  • clustering: k-means
  • decomposition: singular value decomposition (SVD), principal

component analysis (PCA)

slide-30
SLIDE 30

event.cwi.nl/lsde

Collaborative Filtering

slide-31
SLIDE 31

event.cwi.nl/lsde

Alternating Least Squares (ALS)

slide-32
SLIDE 32

event.cwi.nl/lsde

Collaborative Filtering in Spark MLLIB

trainset = sc.textFile("s3n://bads-music-dataset/train_*.gz") .map(lambda l: l.split('\t')) .map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2]))) model = ALS.train(trainset, rank=10, iterations=10) # train testset = # load testing set sc.textFile("s3n://bads-music-dataset/test_*.gz") .map(lambda l: l.split('\t')) .map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2]))) # apply model to testing set (only first two cols) to predict predictions = model.predictAll(testset.map(lambda p: (p[0], p[1]))) .map(lambda r: ((r[0], r[1]), r[2]))

slide-33
SLIDE 33

event.cwi.nl/lsde

Spark MLLIB – ALS Performance

System Wall-clock /me (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481

  • Dataset: Netflix data
  • Cluster: 9 machines.
  • MLlib is an order of magnitude faster than Mahout.
  • MLlib is within factor of 2 of GraphLab.
slide-34
SLIDE 34

event.cwi.nl/lsde

Spark Implementation of ALS

  • Workers load data
  • Models are instantiated

at workers.

  • At each iteration, models

are shared via join between workers.

  • Good scalability.
  • Works on large datasets

Master Workers

slide-35
SLIDE 35

event.cwi.nl/lsde

Spark SQL + MLLIB

slide-36
SLIDE 36

event.cwi.nl/lsde

MLLIB Pointers

  • Website: http://spark.apache.org
  • Tutorials: http://ampcamp.berkeley.edu
  • Spark Summit: http://spark-summit.org
  • Github: https://github.com/apache/spark
  • Mailing lists: user@spark.apache.org dev@spark.apache.org