Apache Spark Corso di Sistemi e Architetture per Big Data A.A. - - PDF document

apache spark
SMART_READER_LITE
LIVE PREVIEW

Apache Spark Corso di Sistemi e Architetture per Big Data A.A. - - PDF document

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica The reference Big Data


slide-1
SLIDE 1

Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini Laurea Magistrale in Ingegneria Informatica

Apache Spark

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

1

Resource Management Data Storage Data Processing High-level Interfaces Support / Integration

Valeria Cardellini - SABD 2019/2020

slide-2
SLIDE 2

MapReduce: weaknesses and limitations

  • Programming model

– Hard to implement everything as a MapReduce program – Multiple MapReduce steps needed even for simple

  • perations
  • E.g., WordCount that also sorts words by their frequency

– Lack of control, structures and data types

  • No native support for iteration

– Each iteration writes/reads data from disk: overhead – Need to design algorithms that minimize number of iterations

2 Valeria Cardellini - SABD 2019/2020

MapReduce: weaknesses and limitations

  • Efficiency (recall HDFS)

– High communication cost: computation (map), communication (shuffle), computation (reduce) – Frequent writing of output to disk – Limited exploitation of main memory

  • Not feasible for real-time data stream

processing

– A MapReduce job requires to scan the entire input before processing it

3 Valeria Cardellini - SABD 2019/2020

slide-3
SLIDE 3

Alternative programming models

  • Based on directed acyclic graphs (DAGs)

– E.g., Spark, Spark Streaming, Storm, Flink

  • SQL-based

– E.g., Hive, Pig, Spark SQL, Vertica

  • NoSQL data stores

– We have already analyzed HBase, MongoDB, Cassandra, …

  • Based on Bulk Synchronous Parallel (BSP)

4 Valeria Cardellini - SABD 2019/2020

Alternative programming models: BSP

  • Bulk Synchronous Parallel (BSP)

– Developed by Leslie Valiant during the 1980s – Considers communication actions en masse – Suitable for graph analytics at massive scale and massive scientific computations (e.g., matrix, graph and network algorithms)

5

  • Examples: Google’s

Pregel, Apache Giraph, Apache Hama

  • Giraph: open source

counterpart to Pregel, developed at Facebook to analyze the users’ social graph

https://giraph.apache.org/

Valeria Cardellini - SABD 2019/2020

slide-4
SLIDE 4

Apache Spark

  • Fast and general-purpose engine for Big Data

processing

– Not a modified version of Hadoop – Leading platform for large-scale SQL, batch processing, stream processing, and machine learning – Unified analytics engine for large-scale data processing

  • In-memory data storage for fast iterative processing

– At least 10x faster than Hadoop

  • Suitable for general execution graphs and powerful
  • ptimizations
  • Compatible with Hadoop’s storage APIs

– Can read/write to any Hadoop-supported system, including HDFS and HBase

6 Valeria Cardellini - SABD 2019/2020

Spark milestones

  • Spark project started in 2009
  • Developed originally at UC Berkeley’s AMPLab by

Matei Zaharia for his PhD thesis

  • Open sourced in 2010, Apache project from 2013
  • In 2014, Zaharia founded Databricks
  • Current version: 2.4.5
  • The most active open source project for Big Data

processing, see next slide

7 Valeria Cardellini - SABD 2019/2020

slide-5
SLIDE 5

Spark popularity

  • Based on Stack Overflow Trends

8 Valeria Cardellini - SABD 2019/2020

Spark: why a new programming model?

  • MapReduce simplified Big Data analysis

– But it executes jobs in a simple but rigid structure

  • Step to process or transform data (map)
  • Step to synchronize (shuffle)
  • Step to combine results (reduce)
  • As soon as MapReduce got popular, users wanted:

– Iterative computations (e.g., iterative graph algorithms and machine learning algorithms, such as PageRank, stochastic gradient descent, K-means clustering) – More interactive ad-hoc queries – More efficiency – Faster in-memory data sharing across parallel jobs (required by both iterative and interactive applications)

9 Valeria Cardellini - SABD 2019/2020

slide-6
SLIDE 6

Data sharing in MapReduce

  • Slow due to replication, serialization, and disk

I/O

10 Valeria Cardellini - SABD 2019/2020

Data sharing in Spark

  • Distributed in-memory: 10x-100x faster than disk

and network

11 Valeria Cardellini - SABD 2019/2020

slide-7
SLIDE 7

Spark vs Hadoop MapReduce

  • Underlying programming paradigm similar to

MapReduce

– Basically “scatter-gather”: scatter data and computation on multiple cluster nodes that run in parallel processing on data portions; gather final results

  • Spark offers a more general data model

– RDDs, DataSets, DataFrames

  • Spark offers a more general and developer-friendly

programming model

– Map -> Transformations in Spark – Reduce -> Actions in Spark

  • Storage agnostics

– Not only HDFS, but also Cassandra, S3, Parquet files, …

Valeria Cardellini - SABD 2019/2020 12

Spark stack

13 Valeria Cardellini - SABD 2019/2020

slide-8
SLIDE 8

Spark core

  • Provides basic functionalities (including task

scheduling, memory management, fault recovery, interacting with storage systems) used by other components

  • Provides a data abstraction called resilient

distributed dataset (RDD), a collection of items distributed across many compute nodes that can be manipulated in parallel

– Spark Core provides many APIs for building and manipulating these collections

  • Written in Scala but APIs for Java, Python

and R

14 Valeria Cardellini - SABD 2019/2020

Spark as unified engine

  • A number of integrated higher-level modules

built on top of Spark

– Can be combined seamlessly in the same application

  • Spark SQL

– To work with structured data – Allows querying data via SQL – Supports many data sources (Hive tables, Parquet, JSON, …) – Extends Spark RDD API

  • Spark Streaming

– To process live streams of data – Extends Spark RDD API

15 Valeria Cardellini - SABD 2019/2020

slide-9
SLIDE 9

PageRank performance (20 iterations, 3.7B edges)

Spark as unified engine

  • MLlib

– Scalable machine learning (ML) library – Many distributed algorithms: feature extraction, classification, regression, clustering, recommendation, …

16 Valeria Cardellini - SABD 2019/2020

  • GraphX

– API for manipulating graphs and performing graph-parallel computations – Includes also common graph algorithms (e.g., PageRank) – Extends Spark RDD API

Logistic regression performance

Spark on top of cluster managers

  • Spark can exploit many cluster resource

managers to execute its applications

  • Spark standalone mode

– Use a simple FIFO scheduler included in Spark

  • Hadoop YARN
  • Mesos

– Mesos and Spark are both from AMPLab @ UC Berkeley

  • Kubernetes

17 Valeria Cardellini - SABD 2019/2020

slide-10
SLIDE 10

Spark architecture

18

  • Master/worker architecture

Valeria Cardellini - SABD 2019/2020

Spark architecture

19

http://spark.apache.org/docs/latest/cluster-overview.html

  • Driver program that talks to cluster manager
  • Worker nodes in which executors run

Valeria Cardellini - SABD 2019/2020

slide-11
SLIDE 11

Spark architecture

  • Each application consists of a driver program and

executors on the cluster

– Driver program: process running main() function of the application and creating the SparkContext object

  • Each application gets its own executors, which are

processes which stay up for the duration of the whole application and run tasks in multiple threads

– Isolation of concurrent applications

  • To run on a cluster, SparkContext connects to a cluster

manager, which allocates resources

  • Once connected, Spark acquires executors on cluster

nodes and sends the application code (e.g., jar) to executors

  • Finally, SparkContext sends tasks to the executors to

run

20 Valeria Cardellini - SABD 2019/2020

Spark data flow

21 Valeria Cardellini - SABD 2019/2020

  • Iterative computation is

particularly useful when working with machine learning algorithms

slide-12
SLIDE 12

Spark programming model

22 Valeria Cardellini - SABD 2019/2020 Executor Executor

Spark programming model

23 Valeria Cardellini - SABD 2019/2020

slide-13
SLIDE 13

Resilient Distributed Datasets (RDDs)

  • RDDs are the key programming abstraction in

Spark: a distributed memory abstraction

  • Immutable, partitioned and fault-tolerant

collection of elements that can be manipulated in parallel

– Like a LinkedList <MyObjects> – Stored in main memory across the cluster nodes

  • Each cluster node that is used to run an application

contains at least one partition of the RDD(s) that is (are) defined in the application

24 Valeria Cardellini - SABD 2019/2020

RDDs: distributed and partitioned

  • Stored in main memory of the executors running in

the worker nodes (when it is possible) or on node local disk (if not enough main memory)

  • Allow executing in parallel the code invoked on

them

– Each executor of a worker node runs the specified code

  • n its partition of the RDD

– Partition: atomic chunk of data (a logical division of data) and basic unit of parallelism – Partitions of an RDD can be stored on different cluster nodes

25 Valeria Cardellini - SABD 2019/2020

slide-14
SLIDE 14

RDDs: immutable and fault-tolerant

  • Immutable once constructed

– i.e., RDD content cannot be modified – Create new RDD based on existing RDD

  • Automatically rebuilt on failure (without

replication)

– Track lineage information so to efficiently recompute missing or lost data due to node failures – For each RDD, Spark knows how it has been constructed and can rebuild it if a failure occurs – This information is represented by means of RDD lineage DAG connecting input data and RDDs

26 Valeria Cardellini - SABD 2019/2020

Spark and RDDs

  • Spark manages the split of

RDDs in partitions and allocates RDDs’ partitions to cluster nodes

  • Spark hides complexities
  • f fault tolerance

– RDDs are automatically rebuilt in case of failure using the RDD lineage DAG, that defines the logical execution plan – DAG can be visualized using Web UI from driver program

27 Valeria Cardellini - SABD 2019/2020

slide-15
SLIDE 15

RDD: API and suitability

  • RDD API

– Clean language-integrated API for Scala, Python, Java, and R – Can be used interactively from Scala console and PySpark console – In Spark also higher-level APIs: DataFrames and DataSets

  • RDD suitability

– Best suited for batch applications that apply the same operation to all the elements in a dataset – Less suited for applications that make asynchronous fine-grained updates to shared state, e.g., storage system for a web application

28 Valeria Cardellini - SABD 2019/2020

Operations in RDD API

  • Spark programs are written in terms of
  • perations on RDDs
  • RDDs are created from external data or other

RDDs

  • RDDs are created and manipulated through:

– Coarse-grained transformations, which define a new dataset based on previous ones

  • Map, filter, join, …

– Actions, which kick off a job to execute on a cluster

  • Count, collect, save, …

29 Valeria Cardellini - SABD 2019/2020

slide-16
SLIDE 16

Spark programming model

  • Based on parallelizable operators

– Higher-order functions that execute user-defined functions in parallel

  • Data flow is composed of any number of data

sources, operators, and data sinks by connecting their inputs and outputs

  • Job description based on directed acyclic

graph (DAG)

30 Valeria Cardellini - SABD 2019/2020

Higher-order functions

  • Higher-order functions: RDDs operators
  • Two types of RDD operators: transformations

and actions

  • Transformations: lazy operations that create

new RDDs

– Lazy: the new RDD representing the result of a computation is not immediately computed but is materialized on demand when an action is called

  • Actions: operations that return a value to the

driver program after running a computation

  • n the dataset or write data to the external

storage

31 Valeria Cardellini - SABD 2019/2020

slide-17
SLIDE 17

Higher-order functions

32

  • Transformations and actions available on

RDDs in Spark

  • Seq[T]: sequence of elements of type T

Valeria Cardellini - SABD 2019/2020

How to create RDD

  • RDD can be created by:

– Parallelizing existing collections of the hosting programming language (e.g., collections and lists of Scala, Java, Python, or R)

  • Number of partitions specified by user
  • In RDD API: parallelize

– From (large) files stored in HDFS or any other file system

  • One partition per HDFS block
  • In RDD API: textFile

– Transforming an existing RDD

  • Number of partitions depends on transformation type
  • In RDD API: transformation operations (map, filter,

flatMap)

33 Valeria Cardellini - SABD 2019/2020

slide-18
SLIDE 18

How to create RDDs

  • Turn an existing collection into an RDD

– sc is Spark context variable – Important parameter: number of partitions to cut the dataset into – Spark will run one task for each partition of the cluster (typical setting: 2-4 partitions for each CPU in the cluster) – Spark tries to set the number of partitions automatically – You can also set it manually by passing it as a second parameter to parallelize, e.g., sc.parallelize(data, 10)

  • Load data from storage (local file system, HDFS, or

S3)

lines = sc.parallelize(["pandas", "i like pandas"]) lines = sc.textFile("/path/to/README.md") Examples in Python

Valeria Cardellini - SABD 2019/2020 34

RDD transformations: map and filter

  • map: takes as input a function which is applied

to each element of the RDD and map each input item to another item

  • filter: generates a new RDD by filtering the

source dataset using the specified function

# transforming each element through a function nums = sc.parallelize([1, 2, 3, 4]) squares = nums.map(lambda x: x * x) # [1, 4, 9, 16] # selecting those elements that func returns true even = squares.filter(lambda num: num % 2 == 0) # [4, 16]

Valeria Cardellini - SABD 2019/2020 35

slide-19
SLIDE 19

RDD transformations: flatMap

  • flatMap: takes as input a function which is

applied to each element of the RDD; can map each input item to zero or more output items

# splitting input lines into words lines = sc.parallelize(["hello world", "hi"]) words = lines.flatMap(lambda line: line.split(" ")) #[’hello', ’world', ’hi’] # mapping each element to zero or more others ranges = nums.flatMap(lambda x: range(0, x, 1)) # [0, 0, 1, 0, 1, 2, 0, 1, 2, 3] Range function in Python: ordered sequence of integer values in range [start;end) with non- zero step

Valeria Cardellini - SABD 2019/2020 36

RDD transformations: reduceByKey

  • reduceByKey: aggregates values with

identical key using the specified function

  • Runs several parallel reduce operations,
  • ne for each key in the dataset, where

each operation combines values that have the same key

x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("a", 1), ... ("b", 1), ("b", 1), ("b", 1), ("b", 1)], 3) # Applying reduceByKey operation y = x.reduceByKey(lambda accum, n: accum + n) # [('b', 5), ('a', 3)]

Valeria Cardellini - SABD 2019/2020 37

slide-20
SLIDE 20

RDD transformations: join

  • join: performs an equi-join on the

keys of two RDDs

  • Only keys that are present in both

RDDs are output

  • Join candidates are independently

processed

users = sc.parallelize([(0, "Alex"), (1, "Bert"), (2, "Curt"), (3, "Don")]) hobbies = sc.parallelize([(0, "writing"), (0, "gym"), (1, "swimming")]) users.join(hobbies).collect() # [(0, ('Alex', 'writing')), (0, ('Alex', 'gym')), (1, ('Bert', 'swimming'))]

Valeria Cardellini - SABD 2019/2020 38

Some RDD actions

  • collect: returns all the elements of the RDD as a list
  • take: returns an array with the first n elements in the

RDD

  • count: returns the number of elements in the RDD

nums = sc.parallelize([1, 2, 3, 4]) nums.collect() # [1, 2, 3, 4] nums.take(3) # [1, 2, 3] nums.count() # 4

Valeria Cardellini - SABD 2019/2020 39

slide-21
SLIDE 21

Some RDD actions

  • reduce: aggregates the elements in the RDD using

the specified function

  • saveAsTextFile: writes the elements of the RDD as

a text file either to the local file system or HDFS

sum = nums.reduce(lambda x, y: x + y) nums.saveAsTextFile(“hdfs://file.txt”)

Valeria Cardellini - SABD 2019/2020 40

Lazy transformations

  • Transformations are lazy: they are not computed till

an action requires a result to be returned to the driver program

  • This design enables Spark to perform operations

more efficiently as operations can be grouped together

– E.g., if there were multiple filter or map operations, Spark can fuse them into one pass – E.g., if Sparks knows that data is partitioned, it can avoid moving it over the network for groupBy

41 Valeria Cardellini - SABD 2019/2020

slide-22
SLIDE 22

First examples

  • Let us analyze some simple examples that

use the Spark API

– WordCount – Pi estimation

  • See http://spark.apache.org/examples.html
  • For additional examples see those distributed

with Spark, e.g.,

– In Java

https://github.com/apache/spark/tree/master/examples/src/main/jav a/org/apache/spark/examples

– In Python

https://github.com/apache/spark/tree/master/examples/src/main/pyt hon

42 Valeria Cardellini - SABD 2019/2020

Example: WordCount in Scala

43

val textFile = sc.textFile("hdfs://…") val words = textFile.flatMap(line => line.split(" ")) val ones = words.map(word => (word, 1)) val counts = ones.reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Valeria Cardellini - SABD 2019/2020

slide-23
SLIDE 23

Example: WordCount in Scala with chaining

44

  • Transformations and actions can be chained

together

– We use a few transformations to build a dataset of (String, Int) pairs called counts and then save it to a file val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Valeria Cardellini - SABD 2019/2020

Example: WordCount in Python

45 Valeria Cardellini - SABD 2019/2020

text_file = sc.textFile("hdfs://...”) counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)

  • utput = counts.collect()
  • utput.saveAsTextFile("hdfs://...")
slide-24
SLIDE 24

Example: WordCount in Java 7 (and earlier)

JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://...");

46

Spark’s Java API allows to create tuples using the scala.Tuple2 class Pair RDDs are RDDs containing key/value pairs

Valeria Cardellini - SABD 2019/2020

Example: WordCount in Java 8

  • Example in Java 7: too verbose
  • Support for lambda expressions from Java 8

– Anonymous methods (methods without names) used to implement a method defined by a functional interface – New arrow operator -> divides the lambda expressions in two parts

  • Left side: parameters required by the lambda expression
  • Right side: actions of the lambda expression

47

JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaPairRDD<String, Integer> counts = textFile .flatMap(s -> Arrays.asList(s.split(" ")).iterator()) .mapToPair(word -> new Tuple2<>(word, 1)) .reduceByKey((a, b) -> a + b); counts.saveAsTextFile("hdfs://...");

Valeria Cardellini - SABD 2019/2020

slide-25
SLIDE 25

Example: Pi estimation

  • How to estimate π? Let’s use Monte Carlo

method

  • By definition, π is the area of a circle with

radius equal to 1 1.Pick a large number of points randomly inside the circumscribed unit square

– A certain number of these points will end up inside the area described by the circle, while the remaining number of these points will lie outside of it (but inside the square)

2.Count the fraction of points that end up inside the circle out of a total population of points randomly thrown at the circumscribed square

48 Valeria Cardellini - SABD 2019/2020

Example: Pi estimation

  • In formulas:
  • The more points generated, the greater the

accuracy of the estimation

49

See animation at https://academo.org/demos/estimating- pi-monte-carlo/ (x, y) . x2 + y2 < 1?

Valeria Cardellini - SABD 2019/2020

slide-26
SLIDE 26

Example: Pi estimation in Python and Scala

def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(xrange(0, NUM_SAMPLES)) \ .filter(inside).count() print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

Valeria Cardellini - SABD 2019/2020 50

val count = sc.parallelize(1 to NUM_SAMPLES).filter { _ => val x = math.random val y = math.random x*x + y*y < 1 }.count() println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")

Initializing Spark: SparkContext

  • First step in Spark program: create SparkContext
  • bject, which is the main entry point for Spark

functionalities

– Represents the connection to Spark cluster, can be used to create RDDs on that cluster – Also available in shell, in the variable called sc

  • Only one SparkContext may be active per JVM

– stop() the active SparkContext before creating a new one

  • SparkConf object: configuration for a Spark application

– Used to set various Spark parameters as key-value pairs

51 Valeria Cardellini - SABD 2019/2020

slide-27
SLIDE 27

WordCount in Java (complete - 1)

52 Valeria Cardellini - SABD 2019/2020

WordCount in Java (complete - 2)

53 Valeria Cardellini - SABD 2019/2020

slide-28
SLIDE 28

RDD persistence

  • By default, each transformed RDD may be

recomputed each time an action is run on it

  • Spark also supports the persistence (or caching) of

RDDs in memory across operations for rapid reuse

– When RDD is persisted, each node stores any partitions of it that it computes in memory and reuses them in other actions

  • n that dataset (or datasets derived from it)

– This allows future actions to be much faster (even 100x) – To persist RDD, use persist() or cache() methods on it – Spark’s cache is fault-tolerant: a lost RDD partition is automatically recomputed using the transformations that

  • riginally created it
  • Key tool for iterative algorithms and fast interactive

use

54 Valeria Cardellini - SABD 2019/2020

RDD persistence: storage level

  • Using persist() you can specify the storage level for

persisting an RDD

– cache() is the same as calling persist() with the default storage level (MEMORY_ONLY)

  • Storage levels for persist():

– MEMORY_ONLY – MEMORY_AND_DISK – MEMORY_ONLY_SER, MEMORY_AND_DISK_SER (Java and Scala)

  • In Python, stored objects will always be serialized with the

Pickle library

– DISK_ONLY, …

55 Valeria Cardellini - SABD 2019/2020

slide-29
SLIDE 29

RDD persistence: storage level

  • Which storage level is best? Few things to

consider:

– Try to keep in-memory as much as possible – Serialization make the objects much more space- efficient

  • But select a fast serialization library (e.g., Kryo library)

– Try not to spill to disk unless the functions that computed your datasets are expensive (e.g., filter a large amount of the data) – Use replicated storage levels only if you want fast fault recovery

56 Valeria Cardellini - SABD 2019/2020

Persistence: example of iterative algorithm

  • Let us consider logistic regression using gradient

descent as example of iterative algorithm

  • A brief introduction to logistic regression

– A widely applied supervised ML algorithm used to classify input data into categories (e.g., spam and non-spam emails) – Probabilistic classification – Binomial (or binary) logistic regression: output can take only two categories (yes/no) – Sigmoid or logistic function: f(x) = 1 / (1+ e-x)

57 Valeria Cardellini - SABD 2019/2020

slide-30
SLIDE 30

Persistence: example of iterative algorithm

  • Idea: search for a hyperplane w that best separates

two sets of points

– Input values (x) are linearly combined using weights (w) to predict output values (y) – x: input variables, our training data – y: output (or target) variables that we are trying to predict given x – w: weights (or parameters) that must be estimated using training data

  • Iterative algorithm that uses gradient descent to

minimize the error

– Start w at a random value – On each iteration, sum a function of w over the data to move w in a direction that improves it

58

For details see https://www.youtube.com/watch?v=SB2vz57eKgc

Valeria Cardellini - SABD 2019/2020

Persistence: example of iterative algorithm

// Load data into an RDD val points = sc.textFile(...).map(readPoint).persist() // Start with a random parameter vector var w = DenseVector.random(D) // On each iteration, update param vector with a sum for (i <- 1 to ITERATIONS) { val gradient = points.map { p => p.x * (1 / (1 + exp(-p.y*(w.dot(p.x)))) - 1) * p.y }.reduce((a, b) => a+b) w -= gradient }

  • Persisting RDD points in memory across iterations

can yield 100x speedup (see next slide)

59

Gradient formula (based

  • n logistic function)

Valeria Cardellini - SABD 2019/2020

Source: “Apache Spark: A Unified Engine for Big Data Processing”

Initial parameter vector We ask Spark to persist RDD if needed for reuse

slide-31
SLIDE 31

Persistence: example of iterative algorithm

  • Spark outperforms Hadoop by up to 100x in iterative

machine learning

– Speedup comes from avoiding I/O and deserialization costs by storing data in memory

60 Valeria Cardellini - SABD 2019/2020

Source: “Apache Spark: A Unified Engine for Big Data Processing”

110 s/iteration 1st iteration 80 s Further iterations 1 s

How Spark works at runtime

  • A Spark application consists of a driver

program that runs the user’s main function and executes various parallel operations on a cluster

61 Valeria Cardellini - SABD 2019/2020

slide-32
SLIDE 32

How Spark works at runtime

  • The application creates RDDs, transforms them, and

runs actions

– This results in a DAG of operators

  • DAG is compiled into stages

– Stages are sequences of RDDs without a shuffle in between

  • Each stage is executed as a series of tasks (one task

for each partition)

  • Actions drive the execution

62

task

Valeria Cardellini - SABD 2019/2020

Stage execution

  • Spark:

– Creates a task for each partition in the new RDD – Schedules and assigns tasks to worker nodes

  • All this happens internally (you need to do

anything)

63 Valeria Cardellini - SABD 2019/2020

slide-33
SLIDE 33

Summary of Spark components

  • RDD: parallel dataset with partitions
  • DAG: logical graph of RDD operations
  • Stage: set of tasks that run in parallel
  • Task: fundamental unit of execution in Spark

64

Coarse grain Fine grain

Valeria Cardellini - SABD 2019/2020

Fault tolerance in Spark

  • RDDs track the series of

transformations used to build them (their lineage)

  • Lineage information is used to

recompute lost data

  • RDDs are stored as a chain of
  • bjects capturing the lineage of

each RDD

65 Valeria Cardellini - SABD 2019/2020

slide-34
SLIDE 34

Job scheduling in Spark

  • Spark job scheduling takes into

account which partitions of persistent RDDs are available in memory

  • When a user runs an action on

an RDD, the scheduler builds a DAG of stages from the RDD lineage graph

  • A stage contains as many

pipelined transformations with narrow dependencies

  • The boundary of a stage:

– Shuffles for wide dependencies – Already computed partitions

66 Valeria Cardellini - SABD 2019/2020

Job scheduling in Spark

  • The scheduler launches tasks to compute

missing partitions from each stage until it computes the target RDD

  • Tasks are assigned to machines based on

data locality

– If a task needs a partition, which is available in the memory of a node, the task is sent to that node

67 Valeria Cardellini - SABD 2019/2020

slide-35
SLIDE 35

Spark stack

68 Valeria Cardellini - SABD 2019/2020

Spark SQL

69

  • Spark library for structured data processing

that allows to run SQL queries on top of Spark

  • Compatible with existing Hive data

– You can run unmodified Hive queries

  • Speedup up to 40x
  • Motivation:

– Many users know SQL – Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes – Can we extend Hive to run on Spark? – Started with Shark project

Valeria Cardellini - SABD 2019/2020

slide-36
SLIDE 36

Spark SQL: the beginning

  • Shark project modified Hive’s backend to run
  • ver Spark, employing in-memory column-
  • riented storage
  • Limitations

– Limited integration with Spark programs – Hive optimizer not designed for Spark

70 Valeria Cardellini - SABD 2019/2020

Hive on Hadoop MapReduce Shark on Spark

Spark SQL

  • Borrows from Shark

– Hive data loading – In-memory column store

  • Adds:

– RDD-aware optimizer (Catalyst Optimizer) – Schema to RDD (DataFrame and Dataset APIs) – Rich language interfaces

71 Valeria Cardellini - SABD 2019/2020

slide-37
SLIDE 37

DataFrame and Dataset APIs

  • Evolution of Spark APIs: DataFrames and Datasets
  • Like RDDs, DataFrames and Datasets are

immutable distributed collection of data

  • Like RDDs, Spark evaluates lazily DataFrames and

Datasets

  • DataFrames (from Spark 1.3) introduce the

concept of a schema to describe data

– Unlike RDD, data is organized into named columns, like a table in a relational database – Work only on structured and semi-structured data – Spark SQL provides APIs to run SQL queries on DataFrames with a simple SQL-like syntax – Since Spark 2.0 DataFrames are implemented as a special case of Datasets

Valeria Cardellini - SABD 2019/2020 72

DataFrame and Dataset APIs

  • Datasets (from Spark 1.6) extend DataFrames

providing type-safe, OO programming interface

– A structured but typed collection of data – DataFrame can be seen as a collection of generic type Dataset[Row], where Row is a generic and untyped JVM

  • bject

– Dataset, by contrast, is a collection of strongly-typed JVM

  • bjects
  • SparkSession class: the entry point for both APIs

Valeria Cardellini - SABD 2019/2020 73

slide-38
SLIDE 38

Datasets

  • Provide the benefits of RDDs (strong typing, ability to

use lambda functions) with those of Spark SQL’s

  • ptimized execution engine (Catalyst optimizer)
  • Available in Scala and Java (no Python and R)
  • Can be constructed from JVM objects
  • Can be manipulated using functional transformations

(map, filter, flatMap, ...)

  • Lazy, i.e. computation is only triggered when an

action is invoked

– Internally, a logical plan that describes the computation required to produce data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner

Valeria Cardellini - SABD 2019/2020 74

Datasets

  • How to create a Dataset?

– From a file using read function – From an existing RDD by converting it – Through transformations applied on existing Datasets

  • Example in Scala:

val names = people.map(_.name) //names is a Dataset[String] Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));

Valeria Cardellini - SABD 2019/2020 75

slide-39
SLIDE 39

DataFrames

  • DataFrame: a Dataset organized into named

columns

  • Conceptually equivalent to a table in a relational

database but have richer optimizations

– Like Datasets, exploit Catalyst optmizer

  • Available in Scala, Java, Python, and R

– In Scala and Java, a DataFrame is represented by a Dataset of Rows

  • Can be constructed from:

– Existing RDDs, either inferring the schema using reflection

  • r programmatically specifying the schema

– Tables in Hive – Structured data files (JSON, Parquet, CSV, Avro)

  • Can be manipulated in similar ways to RDDs

76 Valeria Cardellini - SABD 2019/2020

Parquet file format

  • Parquet is an efficient columnar data storage

format

  • Supported not only by Spark but also by

many other data processing frameworks

– Hive, Impala, Pig, ...

  • Interoperable with other data storage formats

– Avro, Thrift, Protocol Buffers, ...

Valeria Cardellini - SABD 2019/2020 77

slide-40
SLIDE 40

Parquet file format

Valeria Cardellini - SABD 2019/2020 78

  • Supports efficient compression and encoding

schemes

  • Example: Parquet vs. CSV

Spark SQL example: using Parquet

Valeria Cardellini - SABD 2019/2020 79

See https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

slide-41
SLIDE 41

Spark Streaming

  • Spark Streaming: extension that allows to analyze

streaming data

– Ingested and analyzed in micro-batches

  • Uses a high-level abstraction called Dstream (discretized

stream) which represents a continuous stream of data

– A sequence of RDDs

80 Valeria Cardellini - SABD 2019/2020

Spark Streaming

  • Internally, it works as:
  • We will study Spark Streaming later

81 Valeria Cardellini - SABD 2019/2020

slide-42
SLIDE 42

Spark MLlib

  • Provides many distributed ML algorithms

– Classification (e.g., logistic regression), regression, clustering (e.g., K-mean), recommendation, decision trees, random forests, and more

  • Provides also utilities

– For ML: feature transformations, model evaluation and hyper-parameter tuning – For distributed linear algebra (e.g., PCA) and statistics (e.g., summary statistics, hypothesis testing)

  • Adopts DataFrames in order to support a

variety of data types

Valeria Cardellini - SABD 2019/2020 82

Spark MLlib: Example

  • Dataset of labels and feature vectors
  • Learn to predict the labels from feature

vectors using Logistic Regression

Valeria Cardellini - SABD 2019/2020 83

slide-43
SLIDE 43

Combining processing tasks with Spark

Valeria Cardellini - SABD 2019/2020 84

  • It is easy to seamlessly combine Spark

libraries in the same application

  • Example combining SQL, machine learning,

and streaming libraries in Spark

– Read some historical Twitter data using Spark SQL – Train a K-means clustering model using MLlib – Apply the model to a new stream of tweets

Combining processing tasks with Spark

// Load historical data as an RDD using Spark SQL val trainingData = sql( “SELECT location, language FROM old_tweets”) // Train a K-means model using MLlib val model = new KMeans() .setFeaturesCol(“location”) .setPredictionCol(“language”) .fit(trainingData) // Apply the model to new tweets in a stream TwitterUtils.createStream(...) .map(tweet => model.predict(tweet.location))

Valeria Cardellini - SABD 2019/2020 85

slide-44
SLIDE 44

References

  • Zaharia et al., “Spark: Cluster Computing with

Working Sets”, HotCloud’10. http://bit.ly/2rj0uqH

  • Zaharia et al., “Resilient Distributed Datasets: A

Fault-tolerant Abstraction for In-memory Cluster Computing”, NSDI’12. http://bit.ly/1XWkOFB

  • Zaharia et al., “Apache Spark: A Unified Engine For

Big Data Processing”, Commun. ACM, 2016. https://bit.ly/2r9t8cI

  • Karau et al., “Learning Spark - Lightning-Fast Big

Data Analysis”, O’Reilly, 2015.

  • Online resources and MOOCs:

https://sparkhub.databricks.com

86 Valeria Cardellini - SABD 2019/2020