Apache Spark Corso di Sistemi e Architetture per Big Data A.A. - - PDF document

apache spark
SMART_READER_LITE
LIVE PREVIEW

Apache Spark Corso di Sistemi e Architetture per Big Data A.A. - - PDF document

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference Big Data stack High-level


slide-1
SLIDE 1

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Apache Spark

Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini

The reference Big Data stack

1

Resource Management Data Storage Data Processing High-level Interfaces Support / Integration

Valeria Cardellini - SABD 2017/18

slide-2
SLIDE 2

MapReduce: weaknesses and limitations

  • Programming model

– Hard to implement everything as a MapReduce program – Multiple MapReduce steps can be needed even for simple operations – Lack of control, structures and data types

  • No native support for iteration

– Each iteration writes/reads data from disk leading to

  • verheads

– Need to design algorithms that can minimize number

  • f iterations

2 Valeria Cardellini - SABD 2017/18

MapReduce: weaknesses and limitations

  • Efficiency (recall HDFS)

– High communication cost: computation (map), communication (shuffle), computation (reduce) – Frequent writing of output to disk – Limited exploitation of main memory

  • Not feasible for real-time processing

– A MapReduce job requires to scan the entire input – Stream processing and random access impossible

3 Valeria Cardellini - SABD 2017/18

slide-3
SLIDE 3

Alternative programming models

  • Based on directed acyclic graphs (DAGs)

– E.g., Spark, Spark Streaming, Storm

  • SQL-based

– We have already analyzed Hive and Pig

  • NoSQL databases

– We have already analyzed HBase, MongoDB, Cassandra, …

4 Valeria Cardellini - SABD 2017/18

Alternative programming models

  • Based on Bulk Synchronous Parallel (BSP)

– Developed by Leslie Valiant during the 1980s – Considers communication actions en masse – Suitable for graph analytics at massive scale and massive scientific computations (e.g., matrix, graph and network algorithms)

5

  • Examples: Google’s

Pregel, Apache Hama, Apache Giraph

  • Apache Giraph: open

source counterpart to Pregel, used at Facebook to analyze the users’ social graph

Valeria Cardellini - SABD 2017/18

slide-4
SLIDE 4

Apache Spark

  • Fast and general-purpose engine for Big Data

processing

– Not a modified version of Hadoop – It is becoming the leading platform for large-scale SQL, batch processing, stream processing, and machine learning – It is evolving as a unified engine for distributed data processing

  • In-memory data storage for fast iterative processing

– At least 10x faster than Hadoop

  • Suitable for general execution graphs and powerful
  • ptimizations
  • Compatible with Hadoop’s storage APIs

– Can read/write to any Hadoop-supported system, including HDFS and HBase

6 Valeria Cardellini - SABD 2017/18

Project history

  • Spark project started in 2009
  • Developed originally at UC Berkeley’s AMPLab by

Matei Zaharia

  • Open sourced 2010, Apache project from 2013
  • In 2014, Zaharia founded Databricks
  • It is now the most active open source project for Big

Data processing

  • Current version: 2.3.0

7 Valeria Cardellini - SABD 2017/18

slide-5
SLIDE 5

Spark: why a new programming model?

  • MapReduce simplified Big Data analysis

– But it executes jobs in a simple but rigid structure

  • Step to process or transform data (map)
  • Step to synchronize (shuffle)
  • Step to combine results (reduce)
  • As soon as MapReduce got popular, users wanted:

– Iterative computations (e.g., iterative graph algorithms and machine learning, such as stochastic gradient descent, K-means clustering) – More efficiency – More interactive ad-hoc queries – Faster in-memory data sharing across parallel jobs (required by both iterative and interactive applications)

8 Valeria Cardellini - SABD 2017/18

Data sharing in MapReduce

  • Slow due to replication, serialization and disk I/O

9 Valeria Cardellini - SABD 2017/18

slide-6
SLIDE 6

Data sharing in Spark

  • Distributed in-memory: 10x-100x faster than disk

and network

10 Valeria Cardellini - SABD 2017/18

Spark stack

11 Valeria Cardellini - SABD 2017/18

slide-7
SLIDE 7

Spark core

  • Provides basic functionalities (including task

scheduling, memory management, fault recovery, interacting with storage systems) used by other components

  • Provides a data abstraction called resilient

distributed dataset (RDD), a collection of items distributed across many compute nodes that can be manipulated in parallel

– Spark Core provides many APIs for building and manipulating these collections

  • Written in Scala but APIs for Java, Python

and R

12 Valeria Cardellini - SABD 2017/18

Spark as a unified engine

  • A number of integrated higher-level libraries

built on top of Spark

  • Spark SQL

– Spark’s package for working with structured data – Allows querying data via SQL as well as the Apache Hive variant of SQL (Hive QL) and supports many sources of data including Hive tables, Parquet, and JSON – Extends the Spark RDD API

  • Spark Streaming

– Enables processing live streams of data – Extends the Spark RDD API

13 Valeria Cardellini - SABD 2017/18

slide-8
SLIDE 8

Spark as a unified engine

  • MLlib

– Library that provides many distributed machine learning (ML) algorithms, including feature extraction, classification, regression, clustering, recommendation

  • GraphX

– Library that provides an API for manipulating graphs and performing graph-parallel computations – Includes also common graph algorithms (e.g., PageRank) – Extends the Spark RDD API

14 Valeria Cardellini - SABD 2017/18

Spark on top of cluster managers

  • Spark can exploit many cluster resource

managers to execute its applications

  • Spark standalone mode

– Use a simple FIFO scheduler included in Spark

  • Hadoop YARN
  • Mesos

– Mesos and Spark are both from AMPLab @ UC Berkeley

  • Kubernetes

15 Valeria Cardellini - SABD 2017/18

slide-9
SLIDE 9

Spark architecture

16

  • Master/worker architecture

Valeria Cardellini - SABD 2017/18

Spark architecture

17

http://spark.apache.org/docs/latest/cluster-overview.html

  • Driver program that talks to the master
  • Master manages workers in which executors run

Valeria Cardellini - SABD 2017/18

slide-10
SLIDE 10

Spark architecture

  • Applications run as independent sets of processes on a

cluster, coordinated by the SparkContext object in the main program (the driver program)

  • Each application gets its own executor processes, which

stay up for the duration of the whole application and run tasks in multiple threads

  • To run on a cluster, the SparkContext can connect to a

cluster manager (Spark’s cluster manager, Mesos or YARN), which allocates resources across applications

  • Once connected, Spark acquires executors on nodes in

the cluster, which run computations and store data. Next, it sends the application code to the executors. Finally, SparkContext sends tasks to the executors to run

18 Valeria Cardellini - SABD 2017/18

Spark data flow

19 Valeria Cardellini - SABD 2017/18

slide-11
SLIDE 11

Spark programming model

20 Valeria Cardellini - SABD 2017/18

Spark programming model

21 Valeria Cardellini - SABD 2017/18

slide-12
SLIDE 12

Resilient Distributed Dataset (RDD)

  • RDDs are the key programming abstraction in

Spark: a distributed memory abstraction

  • Immutable, partitioned and fault-tolerant

collection of elements that can be manipulated in parallel

– Like a LinkedList <MyObjects> – Cached in memory across the cluster nodes

  • Each node of the cluster that is used to run an application

contains at least one partition of the RDD(s) that is (are) defined in the application

22 Valeria Cardellini - SABD 2017/18

Resilient Distributed Dataset (RDD)

  • Stored in main memory of the executors running in the

worker nodes (when it is possible) or on node local disk (if not enough main memory)

– Can persist in memory, on disk, or both

  • Allow executing in parallel the code invoked on them

– Each executor of a worker node runs the specified code on its partition of the RDD – A partition is an atomic piece of information – Partitions of an RDD can be stored on different cluster nodes

23 Valeria Cardellini - SABD 2017/18

slide-13
SLIDE 13

Resilient Distributed Dataset (RDD)

  • Immutable once constructed

– i.e., the RDD content cannot be modified

  • Automatically rebuilt on failure (but no replication)

– Track lineage information to efficiently recompute lost data – For each RDD, Spark knows how it has been constructed and can rebuilt it if a failure occurs – This information is represented by means of a DAG connecting input data and RDDs

  • RDD API

– Clean language-integrated API for Scala, Python, Java, and R – Can be used interactively from Scala console

24 Valeria Cardellini - SABD 2017/18

Resilient Distributed Dataset (RDD)

  • RDD can be created by:

– Parallelizing existing collections of the hosting programming language (e.g., collections and lists of Scala, Java, Python, or R)

  • Number of partitions specified by user
  • In RDD API: parallelize!

– From (large) files stored in HDFS or any other file system

  • One partition per HDFS block
  • In RDD API: textFile

– By transforming an existing RDD

  • Number of partitions depends on transformation type
  • In RDD API: transformation operations (map, filter,

flatMap)

25 Valeria Cardellini - SABD 2017/18

slide-14
SLIDE 14

Resilient Distributed Dataset (RDD)

  • Applications suitable for RDDs

– Batch applications that apply the same operation to all the elements in a dataset

  • Applications not suitable for RDDs

– Applications that make asynchronous fine-grained updates to shared state, e.g., storage system for a web application

26 Valeria Cardellini - SABD 2017/18

Resilient Distributed Dataset (RDD)

  • Spark programs are written in terms of
  • perations on RDDs
  • RDDs built and manipulated through:

– Coarse-grained transformations

  • Map, filter, join, …

– Actions

  • Count, collect, save, …

27 Valeria Cardellini - SABD 2017/18

slide-15
SLIDE 15

Spark and RDDs

  • Spark manages scheduling and

synchronization of the jobs

  • Manages the split of RDDs in partitions and

allocates RDDs’ partitions in the nodes of the cluster

  • Hides complexities of fault-tolerance and slow

machines

  • RDDs are automatically rebuilt in case of

machine failure

28 Valeria Cardellini - SABD 2017/18

Spark and RDDs

29 Valeria Cardellini - SABD 2017/18

slide-16
SLIDE 16

Spark programming model

  • Spark programming model is based on parallelizable
  • perators
  • Parallelizable operators are higher-order functions

that execute user-defined functions in parallel

  • A data flow is composed of any number of data

sources, operators, and data sinks by connecting their inputs and outputs

  • Job description based on DAG

30 Valeria Cardellini - SABD 2017/18

Higher-order functions

  • Higher-order functions: RDDs operators
  • Two types of RDD operators: transformations

and actions

  • Transformations: lazy operations that create

new RDDs

– Lazy: the new RDD representing the result of a computation is not immediately computed but is materialized on demand when an action is called

  • Actions: operations that return a value to the

driver program after running a computation on the dataset or write data to the external storage

31 Valeria Cardellini - SABD 2017/18

slide-17
SLIDE 17

Higher-order functions

32

  • Transformations and actions available on RDDs in

Spark

  • Seq[T] denotes a sequence of elements of type T

Valeria Cardellini - SABD 2017/18

RDD transformations: map

  • map: takes as input a function which is applied

to each element of the RDD and map each input item to another item

  • filter: generates a new RDD by filtering the

source dataset using the specified function

  • flatMap: takes as input a function which is

applied to each element of the RDD; can map each input item to zero or more output items

33

Examples in Scala

Range class in Scala:

  • rdered sequence of

integer values in range [start;end) with non- zero step sc is SparkContext

Valeria Cardellini - SABD 2017/18

slide-18
SLIDE 18

RDD transformations: reduceByKey

  • reduceByKey: aggregates values with

identical key using the specified function

  • Groups are independently processed

34 Valeria Cardellini - SABD 2017/18

RDD transformations: join

  • join: Performs an equi-join on the

key of two RDDs

  • Join candidates are independently

processed

35 Valeria Cardellini - SABD 2017/18

slide-19
SLIDE 19

Basic RDD actions

  • collect: returns all the elements of the RDD as an

array

  • take: returns an array with the first n elements in the

RDD

  • count: returns the number of elements in the RDD

36 Valeria Cardellini - SABD 2017/18

Basic RDD actions

  • reduce: aggregates the elements in the RDD using

the specified function

  • saveAsTextFile: writes the elements of the RDD

as a text file either to the local file system or HDFS

37 Valeria Cardellini - SABD 2017/18

slide-20
SLIDE 20

How to create RDDs

  • Turn a collection into an RDD

– Important parameter: number of partitions to cut the dataset into – Spark will run one task for each partition of the cluster (typical setting: 2-4 partitions for each CPU in the cluster) – Spark tries to set the number of partitions automatically – You can also set it manually by passing it as a second parameter to parallelize, e.g., sc.parallelize(data, 10)

  • Load text file from local file system, HDFS, or S3

38 Valeria Cardellini - SABD 2017/18

Lazy transformations

  • Transformations are lazy: they are not computed till

an action requires a result to be returned to the driver program

  • This design enables Spark to perform operations

more efficiently as operations can be grouped together

– E.g., if there were multiple filter or map operations, Spark can fuse them into one pass – E.g., if Sparks knows that data is partitioned, it can avoid moving it over the network for groupBy

39 Valeria Cardellini - SABD 2017/18

slide-21
SLIDE 21

First examples

  • Let us analyze some simple examples that

use the Spark API

– WordCount – Pi estimation

  • See http://spark.apache.org/examples.html

40 Valeria Cardellini - SABD 2017/18

WordCount in Scala

41

val textFile = sc.textFile("hdfs://…") val words = textFile.flatMap(line => line.split(" ")) val ones = words.map(word => (word, 1)) val counts = ones.reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Valeria Cardellini - SABD 2017/18

slide-22
SLIDE 22

WordCount in Scala: with chaining

42

  • Transformations and actions can be chained

together

– We use a few transformations to build a dataset of (String, Int) pairs called counts and then save it to a file val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Valeria Cardellini - SABD 2017/18

WordCount in Python

text_file = sc.textFile("hdfs://...”) counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)

  • utput = counts.collect()
  • utput.saveAsTextFile("hdfs://...")

43 Valeria Cardellini - SABD 2017/18

slide-23
SLIDE 23

WordCount in Java 7 (and earlier)

JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://...");

44

Spark’s Java API allows to create tuples using the scala.Tuple2 class Pair RDDs are RDDs containing key/value pairs

Valeria Cardellini - SABD 2017/18

WordCount in Java 8

  • Example in Java 7: too verbose
  • Support for lambda expressions from Java 8

– Anonymous methods (methods without names) used to implement a method defined by a functional interface – New arrow operator -> divides the lambda expressions in two parts

  • Left side: parameters required by the lambda expression
  • Right side: actions of the lambda expression

45

JavaRDD<String> textFile = sc.textFile("hdfs://..."); JavaPairRDD<String, Integer> counts = textFile .flatMap(s -> Arrays.asList(s.split(" ")).iterator()) .mapToPair(word -> new Tuple2<>(word, 1)) .reduceByKey((a, b) -> a + b); counts.saveAsTextFile("hdfs://...");

Valeria Cardellini - SABD 2017/18

slide-24
SLIDE 24

SparkContext

  • Main entry point for Spark functionality

– Represents the connection to a Spark cluster, and can be used to create RDDs on that cluster – Also available in shell (as variable sc)

  • Only one SparkContext may be active per JVM

– You must stop() the active SparkContext before creating a new one

  • SparkConf class: configuration for a Spark application

– Used to set various Spark parameters as key-value pairs

46 Valeria Cardellini - SABD 2017/18

WordCount in Java (complete - 1)

47 Valeria Cardellini - SABD 2017/18

slide-25
SLIDE 25

WordCount in Java (complete - 2)

48 Valeria Cardellini - SABD 2017/18

Persistence

  • By default, each transformed RDD may be recomputed

each time an action is run on it

  • Spark also supports the persistence of RDDs in memory

across operations for rapid reuse

– When RDD is persisted, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it) – This allows future actions to be much faster (even 100x) – To persist RDD, use persist() or cache() methods on it – Storage level can be specified (memory as default) – Spark’s cache is fault-tolerant: a lost RDD partition is automatically recomputed using the transformations that

  • riginally created it
  • Useful when data is accessed repeatedly, e.g., when

querying a small “hot” dataset or when running an iterative algorithm (e.g., logistic regression, PageRank)

49 Valeria Cardellini - SABD 2017/18

slide-26
SLIDE 26

Persistence: storage level

  • Using persist() you can specify the storage level for

persisting an RDD

– cache() is the same as calling persist() with the default storage level (MEMORY_ONLY)

  • Storage levels for persist():

– MEMORY_ONLY – MEMORY_AND_DISK – MEMORY_ONLY_SER, MEMORY_AND_DISK_SER – DISK_ONLY, …

  • Which storage level is best? Few things to consider:

– Try to keep in-memory as much as possible – Serialization make the objects much more space-efficient – Try not to spill to disk unless the functions that computed your datasets are expensive – Use replication only if you want fault tolerance

50 Valeria Cardellini - SABD 2017/18

Persistence: example of iterative algorithm

  • Let us consider logistic regression using gradient

descent as example of iterative algorithm

  • A brief introduction to logistic regression

– A widely applied supervised ML algorithm used to classify input data into categories (e.g., spam and non-spam emails) – Probabilistic classification – Binomial (or binary) logistic regression: output can take only two categories (yes/no) – Sigmoid or logistic function: g(z) = 1 / (1+ e-z)

51 Valeria Cardellini - SABD 2017/18

slide-27
SLIDE 27

Persistence: example of iterative algorithm

  • Idea: search for a hyperplane w that best separates

two sets of points

– Input values (x) are linearly combined using weights (w) to predict output values (y) – x: input variables, our training data – y: output or target variables that we are trying to predict given x – w: weights or parameters that must be estimated using the training data

  • Iterative algorithm that uses gradient descent to

minimize the error

– Start w at a random value – On each iteration, sum a function of w over the data to move w in a direction that improves it

52

For details see https://www.youtube.com/watch?v=SB2vz57eKgc

Valeria Cardellini - SABD 2017/18

Persistence: example of iterative algorithm

// Load data into an RDD val points = sc.textFile(...).map(readPoint).persist() // Start with a random parameter vector var w = DenseVector.random(D) // On each iteration, update param vector with a sum for (i <- 1 to ITERATIONS) { val gradient = points.map { p => p.x * (1 / (1 + exp(-p.y*(w.dot(p.x)))) - 1) * p.y }.reduce((a, b) => a+b) w -= gradient }

  • Persisting the RDD points in memory across

iterations can yield 100x speedup (see next slide)

53

Gradient formula (based

  • n logistic function)

Valeria Cardellini - SABD 2017/18

Source: “Apache Spark: A Unified Engine for Big Data Processing”

Initial parameter vector We ask Spark to persist RDD if needed for reuse

slide-28
SLIDE 28

Persistence: example of iterative algorithm

  • Spark outperforms Hadoop by up to 100x in iterative

machine learning

– Speedup comes from avoiding I/O and deserialization costs by storing data in memory

54 Valeria Cardellini - SABD 2017/18

Source: “Apache Spark: A Unified Engine for Big Data Processing”

110 s/iteration 1st iteration 80 s Further iterations 1 s

How Spark works at runtime

  • A Spark application consists of a driver

program that runs the user’s main function and executes various parallel operations on a cluster

55 Valeria Cardellini - SABD 2017/18

slide-29
SLIDE 29

How Spark works at runtime

  • The application creates RDDs, transforms them, and

runs actions

– This results in a DAG of operators

  • DAG is compiled into stages

– Stages are sequences of RDDs without a shuffle in between

  • Each stage is executed as a series of tasks (one task

for each partition)

  • Actions drive the execution

56

task

Valeria Cardellini - SABD 2017/18

Stage execution

  • Spark:

– Creates a task for each partition in the new RDD – Schedules and assigns tasks to worker nodes

  • All this happens internally (you need to do

anything)

57 Valeria Cardellini - SABD 2017/18

slide-30
SLIDE 30

Summary of Spark components

  • RDD: parallel dataset with partitions
  • DAG: logical graph of RDD operations
  • Stage: set of tasks that run in parallel
  • Task: fundamental unit of execution in Spark

58

Coarse grain Fine grain

Valeria Cardellini - SABD 2017/18

Fault tolerance in Spark

  • RDDs track the series of

transformations used to build them (their lineage)

  • Lineage information is used to

recompute lost data

  • RDDs are stored as a chain of
  • bjects capturing the lineage of

each RDD

59 Valeria Cardellini - SABD 2017/18

slide-31
SLIDE 31

Job scheduling in Spark

  • Spark job scheduling takes into

account which partitions of persistent RDDs are available in memory

  • When a user runs an action on

an RDD: the scheduler builds a DAG of stages from the RDD lineage graph

  • A stage contains as many

pipelined transformations with narrow dependencies

  • The boundary of a stage:

– Shuffles for wide dependencies – Already computed partitions

60 Valeria Cardellini - SABD 2017/18

Job scheduling in Spark

  • The scheduler launches tasks to compute

missing partitions from each stage until it computes the target RDD

  • Tasks are assigned to machines based on

data locality

– If a task needs a partition, which is available in the memory of a node, the task is sent to that node

61 Valeria Cardellini - SABD 2017/18

slide-32
SLIDE 32

Spark stack

62 Valeria Cardellini - SABD 2017/18

Spark SQL

63

  • Spark library for structured data processing

that allows to run SQL queries on top of Spark

  • Compatible with existing Hive data

– You can run unmodified Hive queries

  • Speedup up to 40x
  • Motivation:

– Hive is great, but Hadoop’s execution engine makes even the smallest queries take minutes – Many users know SQL – Can we extend Hive to run on Spark? – Started with the Shark project

Valeria Cardellini - SABD 2017/18

slide-33
SLIDE 33

Spark SQL: the beginning

  • Shark project modified the Hive backend to run over

Spark

  • Shark employed in-memory column-oriented storage
  • Limitations

– Limited integration with Spark programs – Hive optimizer not designed for Spark

64 Valeria Cardellini - SABD 2017/18

Hive on Hadoop MapReduce Shark on Spark

Spark SQL

  • Borrows from Shark

– Hive data loading – In-memory column store

  • Adds:

– RDD-aware optimizer (Catalyst Optimizer) – Schema to RDD (DataFrame API) – Rich language interfaces

65 Valeria Cardellini - SABD 2017/18

slide-34
SLIDE 34

Spark SQL: efficient in-memory storage

  • Simply caching records as Java objects is

inefficient due to high per-object overhead

  • Spark SQL employs column-oriented storage

using arrays of primitive types

  • This format is called Parquet

66 Valeria Cardellini - SABD 2017/18

Spark SQL: DataFrame API

  • Extension of RDD that provides a distributed

collection of rows with a homogeneous and known schema

– Equivalent to a table in a relational database (or a data frame in Python)

  • Can be constructed from: structured data files, tables

in Hive, external databases, or existing RDDs

  • Can be manipulated in similar ways to RDDs
  • DataFrames are lazy
  • The DataFrame API is available in Scala, Java,

Python, and R

67 Valeria Cardellini - SABD 2017/18

slide-35
SLIDE 35

Spark on Amazon EMR

  • Example in Scala using Hive on Spark: query about

domestic flights in the US http://amzn.to/2r9vgme

val df = session.read.parquet("s3://us-east-1.elasticmapreduce.samples/ flightdata/input/") //Parquet files can be registered as tables and used in SQL statements df.createOrReplaceTempView("flights") //Top 10 airports with the most departures since 2000 val topDepartures = session.sql("SELECT origin, count(*) AS total_departures \ FROM flights WHERE year >= '2000' \ GROUP BY origin \ ORDER BY total_departures DESC LIMIT 10")

68 Valeria Cardellini - SABD 2017/18

Spark Streaming

  • Spark Streaming: extension that allows to analyze

streaming data

– Ingested and analyzed in micro-batches

  • Uses a high-level abstraction called Dstream (discretized

stream) which represents a continuous stream of data

– A sequence of RDDs

69 Valeria Cardellini - SABD 2017/18

slide-36
SLIDE 36

Spark Streaming

  • Internally, it works as:
  • We will study Spark Streaming later

70 Valeria Cardellini - SABD 2017/18

Combining processing tasks with Spark

Valeria Cardellini - SABD 2017/18 71

  • Since Spark’s libraries all operate on RDDs

as the data abstraction, it is easy to combine them in applications

  • Example combining the SQL, machine

learning, and streaming libraries in Spark

– Read some historical Twitter data using Spark SQL – Trains a K-means clustering model using MLlib – Apply the model to a new stream of tweets

slide-37
SLIDE 37

Combining processing tasks with Spark

// Load historical data as an RDD using Spark SQL val trainingData = sql( “SELECT location, language FROM old_tweets”) // Train a K-means model using MLlib val model = new KMeans() .setFeaturesCol(“location”) .setPredictionCol(“language”) .fit(trainingData) // Apply the model to new tweets in a stream TwitterUtils.createStream(...) .map(tweet => model.predict(tweet.location))

Valeria Cardellini - SABD 2017/18 72

References

  • Zaharia et al., “Spark: Cluster Computing with

Working Sets”, HotCloud’10. http://bit.ly/2rj0uqH

  • Zaharia et al., “Resilient Distributed Datasets: A

Fault-tolerant Abstraction for In-memory Cluster Computing”, NSDI’12. http://bit.ly/1XWkOFB

  • Zaharia et al., “Apache Spark: A Unified Engine For

Big Data Processing”, Commun. ACM, 2016. https://bit.ly/2r9t8cI

  • Karau et al., “Learning Spark - Lightning-Fast Big

Data Analysis”, O’Reilly, 2015.

  • Online resources and MOOCs:

https://sparkhub.databricks.com

73 Valeria Cardellini - SABD 2017/18