An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS - - PowerPoint PPT Presentation

an introduction to apache spark
SMART_READER_LITE
LIVE PREVIEW

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS - - PowerPoint PPT Presentation

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 2 / 67 Big Data


slide-1
SLIDE 1

An Introduction to Apache Spark

Amir H. Payberah

amir@sics.se

SICS Swedish ICT

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

1 / 67

slide-2
SLIDE 2

Big Data

small data big data

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

2 / 67

slide-3
SLIDE 3

Big Data

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

3 / 67

slide-4
SLIDE 4

How To Store and Process Big Data?

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

4 / 67

slide-5
SLIDE 5

Scale Up vs. Scale Out

◮ Scale up or scale vertically ◮ Scale out or scale horizontally

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

5 / 67

slide-6
SLIDE 6

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

6 / 67

slide-7
SLIDE 7

Three Main Layers: Big Data Stack

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

7 / 67

slide-8
SLIDE 8

Resource Management Layer

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

8 / 67

slide-9
SLIDE 9

Storage Layer

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

9 / 67

slide-10
SLIDE 10

Processing Layer

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

10 / 67

slide-11
SLIDE 11

Spark Processing Engine

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

11 / 67

slide-12
SLIDE 12

Cluster Programming Model

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

12 / 67

slide-13
SLIDE 13

Warm-up Task (1/2)

◮ We have a huge text document. ◮ Count the number of times each distinct word appears in the file ◮ Application: analyze web server logs to find popular URLs.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

13 / 67

slide-14
SLIDE 14

Warm-up Task (2/2)

◮ File is too large for memory, but all word, count pairs fit in mem-

  • ry.

◮ words(doc.txt) | sort | uniq -c

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

14 / 67

slide-15
SLIDE 15

Warm-up Task in MapReduce

◮ words(doc.txt) | sort | uniq -c

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

15 / 67

slide-16
SLIDE 16

Warm-up Task in MapReduce

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

15 / 67

slide-17
SLIDE 17

Warm-up Task in MapReduce

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

15 / 67

slide-18
SLIDE 18

Warm-up Task in MapReduce

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

15 / 67

slide-19
SLIDE 19

Warm-up Task in MapReduce

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

15 / 67

slide-20
SLIDE 20

Warm-up Task in MapReduce

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. ◮ Write the result.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

15 / 67

slide-21
SLIDE 21

Example: Word Count

◮ Consider doing a word count of the following file using MapReduce:

Hello World Bye World Hello Hadoop Goodbye Hadoop

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

16 / 67

slide-22
SLIDE 22

Example: Word Count - map

◮ The map function reads in words one a time and outputs (word, 1)

for each parsed input word.

◮ The map function output is:

(Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

17 / 67

slide-23
SLIDE 23

Example: Word Count - shuffle

◮ The shuffle phase between map and reduce phase creates a list of

values associated with each key.

◮ The reduce function input is:

(Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1))

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

18 / 67

slide-24
SLIDE 24

Example: Word Count - reduce

◮ The reduce function sums the numbers in the list for each key and

  • utputs (word, count) pairs.

◮ The output of the reduce function is the output of the MapReduce

job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

19 / 67

slide-25
SLIDE 25

Example: Word Count - map

public static class MyMap extends Mapper<...> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

20 / 67

slide-26
SLIDE 26

Example: Word Count - reduce

public static class MyReduce extends Reducer<...> { public void reduce(Text key, Iterator<...> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) sum += values.next().get(); context.write(key, new IntWritable(sum)); } }

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

21 / 67

slide-27
SLIDE 27

Example: Word Count - driver

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MyMap.class); job.setCombinerClass(MyReduce.class); job.setReducerClass(MyReduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

22 / 67

slide-28
SLIDE 28

Data Flow Programming Model

◮ Most current cluster programming models are based on acyclic data

flow from stable storage to stable storage.

◮ Benefits of data flow: runtime can decide where to run tasks and

can automatically recover from failures.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

23 / 67

slide-29
SLIDE 29

Data Flow Programming Model

◮ Most current cluster programming models are based on acyclic data

flow from stable storage to stable storage.

◮ Benefits of data flow: runtime can decide where to run tasks and

can automatically recover from failures.

◮ MapReduce greatly simplified big data analysis on large unreliable

clusters.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

23 / 67

slide-30
SLIDE 30

MapReduce Limitation

◮ MapReduce programming model has not been designed for complex

  • perations, e.g., data mining.

◮ Very expensive (slow), i.e., always goes to disk and HDFS.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

24 / 67

slide-31
SLIDE 31

Spark (1/3)

◮ Extends MapReduce with more operators. ◮ Support for advanced data flow graphs. ◮ In-memory and out-of-core processing.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

25 / 67

slide-32
SLIDE 32

Spark (2/3)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

26 / 67

slide-33
SLIDE 33

Spark (2/3)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

26 / 67

slide-34
SLIDE 34

Spark (3/3)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

27 / 67

slide-35
SLIDE 35

Spark (3/3)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

27 / 67

slide-36
SLIDE 36

Resilient Distributed Datasets (RDD) (1/2)

◮ A distributed memory abstraction. ◮ Immutable collections of objects spread across a cluster.

  • Like a LinkedList <MyObjects>

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

28 / 67

slide-37
SLIDE 37

Resilient Distributed Datasets (RDD) (2/2)

◮ An RDD is divided into a number of partitions, which are atomic

pieces of information.

◮ Partitions of an RDD can be stored on different nodes of a cluster.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

29 / 67

slide-38
SLIDE 38

Resilient Distributed Datasets (RDD) (2/2)

◮ An RDD is divided into a number of partitions, which are atomic

pieces of information.

◮ Partitions of an RDD can be stored on different nodes of a cluster. ◮ Built through coarse grained transformations, e.g., map, filter,

join.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

29 / 67

slide-39
SLIDE 39

Spark Programming Model

◮ Job description based on directed acyclic graphs (DAG).

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

30 / 67

slide-40
SLIDE 40

Creating RDDs

◮ Turn a collection into an RDD. val a = sc.parallelize(Array(1, 2, 3))

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

31 / 67

slide-41
SLIDE 41

Creating RDDs

◮ Turn a collection into an RDD. val a = sc.parallelize(Array(1, 2, 3)) ◮ Load text file from local FS, HDFS, or S3. val a = sc.textFile("file.txt") val b = sc.textFile("directory/*.txt") val c = sc.textFile("hdfs://namenode:9000/path/file")

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

31 / 67

slide-42
SLIDE 42

RDD Higher-Order Functions

◮ Higher-order functions: RDDs operators. ◮ There are two types of RDD operators: transformations and actions.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

32 / 67

slide-43
SLIDE 43

RDD Transformations - Map

◮ All pairs are independently processed.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

33 / 67

slide-44
SLIDE 44

RDD Transformations - Map

◮ All pairs are independently processed. // passing each element through a function. val nums = sc.parallelize(Array(1, 2, 3)) val squares = nums.map(x => x * x) // {1, 4, 9} // selecting those elements that func returns true. val even = squares.filter(x => x % 2 == 0) // {4} // mapping each element to zero or more others. nums.flatMap(x => Range(0, x, 1)) // {0, 0, 1, 0, 1, 2}

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

33 / 67

slide-45
SLIDE 45

RDD Transformations - Reduce

◮ Pairs with identical key are grouped. ◮ Groups are independently processed.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

34 / 67

slide-46
SLIDE 46

RDD Transformations - Reduce

◮ Pairs with identical key are grouped. ◮ Groups are independently processed. val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2))) pets.reduceByKey((x, y) => x + y) // {(cat, 3), (dog, 1)} pets.groupByKey() // {(cat, (1, 2)), (dog, (1))}

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

34 / 67

slide-47
SLIDE 47

RDD Transformations - Join

◮ Performs an equi-join on the key. ◮ Join candidates are independently pro-

cessed.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

35 / 67

slide-48
SLIDE 48

RDD Transformations - Join

◮ Performs an equi-join on the key. ◮ Join candidates are independently pro-

cessed.

val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"), ("about.html", "3.4.5.6"), ("index.html", "1.3.3.1"))) val pageNames = sc.parallelize(Seq(("index.html", "Home"), ("about.html", "About"))) visits.join(pageNames) // ("index.html", ("1.2.3.4", "Home")) // ("index.html", ("1.3.3.1", "Home")) // ("about.html", ("3.4.5.6", "About"))

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

35 / 67

slide-49
SLIDE 49

Basic RDD Actions (1/2)

◮ Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

36 / 67

slide-50
SLIDE 50

Basic RDD Actions (1/2)

◮ Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) ◮ Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

36 / 67

slide-51
SLIDE 51

Basic RDD Actions (1/2)

◮ Return all the elements of the RDD as an array. val nums = sc.parallelize(Array(1, 2, 3)) nums.collect() // Array(1, 2, 3) ◮ Return an array with the first n elements of the RDD. nums.take(2) // Array(1, 2) ◮ Return the number of elements in the RDD. nums.count() // 3

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

36 / 67

slide-52
SLIDE 52

Basic RDD Actions (2/2)

◮ Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y)

  • r

nums.reduce(_ + _) // 6

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

37 / 67

slide-53
SLIDE 53

Basic RDD Actions (2/2)

◮ Aggregate the elements of the RDD using the given function. nums.reduce((x, y) => x + y)

  • r

nums.reduce(_ + _) // 6 ◮ Write the elements of the RDD as a text file. nums.saveAsTextFile("hdfs://file.txt")

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

37 / 67

slide-54
SLIDE 54

SparkContext

◮ Main entry point to Spark functionality. ◮ Available in shell as variable sc. ◮ In standalone programs, you should make your own. val sc = new SparkContext(master, appName, [sparkHome], [jars])

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

38 / 67

slide-55
SLIDE 55

Example: Word Count

val textFile = sc.textFile("hdfs://...") val words = textFile.flatMap(line => line.split(" ")) val ones = words.map(word => (word, 1)) val counts = ones.reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

39 / 67

slide-56
SLIDE 56

Example: Word Count

val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

40 / 67

slide-57
SLIDE 57

Lineage

◮ Lineage: transformations used to build

an RDD.

◮ RDDs are stored as a chain of objects

capturing the lineage of each RDD.

val file = sc.textFile("hdfs://...") val sics = file.filter(_.contains("SICS")) val cachedSics = sics.cache() val ones = cachedSics.map(_ => 1) val count = ones.reduce(_+_)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

41 / 67

slide-58
SLIDE 58

Spark Execution Plan

1 Connects to a cluster manager, which allocate resources across ap-

plications.

2 Acquires executors on cluster nodes (worker processes) to run com-

putations and store data.

3 Sends app code to the executors. 4 Sends tasks for the executors to run. Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

42 / 67

slide-59
SLIDE 59

Spark SQL

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

43 / 67

slide-60
SLIDE 60

Spark SQL

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

44 / 67

slide-61
SLIDE 61

DataFrame

◮ A DataFrame is a distributed collection of rows with a homogeneous

schema.

◮ It is equivalent to a table in a relational database. ◮ It can also be manipulated in similar ways to RDDs. ◮ DataFrames are lazy.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

45 / 67

slide-62
SLIDE 62

Adding Schema to RDDs

◮ Spark + RDD: functional transformations on partitioned collections

  • f opaque objects.

◮ SQL + DataFrame: declarative transformations on partitioned col-

lections of tuples.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

46 / 67

slide-63
SLIDE 63

Creating DataFrames

◮ The entry point into all functionality in Spark SQL is the

SQLContext.

val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.json(...)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

47 / 67

slide-64
SLIDE 64

DataFrame Operations (1/2)

◮ Domain-specific language for structured data manipulation. // Show the content of the DataFrame df.show() // age name // null Michael // 30 Andy // 19 Justin // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Select only the "name" column df.select("name").show() // name // Michael // Andy // Justin

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

48 / 67

slide-65
SLIDE 65

DataFrame Operations (2/2)

◮ Domain-specific language for structured data manipulation. // Select everybody, but increment the age by 1 df.select(df("name"), df("age") + 1).show() // name (age + 1) // Michael null // Andy 31 // Justin 20 // Select people older than 21 df.filter(df("age") > 21).show() // age name // 30 Andy // Count people by age df.groupBy("age").count().show() // age count // null 1 // 19 1 // 30 1

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

49 / 67

slide-66
SLIDE 66

Running SQL Queries Programmatically

◮ Running SQL queries programmatically and returns the result as a

DataFrame.

◮ Using the sql function on a SQLContext. val sqlContext = ... // An existing SQLContext val df = sqlContext.sql("SELECT * FROM table")

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

50 / 67

slide-67
SLIDE 67

Converting RDDs into DataFrames

// Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile(...).map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt)).toDF() people.registerTempTable("people")

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

51 / 67

slide-68
SLIDE 68

Converting RDDs into DataFrames

// Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile(...).map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt)).toDF() people.registerTempTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. val teenagers = sqlContext .sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are DataFrames. teenagers.map(t => "Name: " + t(0)).collect().foreach(println) teenagers.map(t => "Name: " + t.getAs[String]("name")).collect() .foreach(println)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

51 / 67

slide-69
SLIDE 69

Spark Streaming

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

52 / 67

slide-70
SLIDE 70

Data Streaming

◮ Many applications must process large streams of live data and pro-

vide results in real-time.

  • Wireless sensor networks
  • Traffic management applications
  • Stock marketing
  • Environmental monitoring applications
  • Fraud detection tools
  • ...

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

53 / 67

slide-71
SLIDE 71

Stream Processing Systems

◮ Stream Processing Systems (SPS): data-in-motion analytics

  • Processing information as it flows, without storing them persistently.

◮ Database Management Systems (DBMS): data-at-rest analytics

  • Store and index data before processing it.
  • Process data only when explicitly asked by the users.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

54 / 67

slide-72
SLIDE 72

DBMS vs. SPS (1/2)

◮ DBMS: persistent data where updates are relatively infrequent. ◮ SPS: transient data that is continuously updated.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

55 / 67

slide-73
SLIDE 73

DBMS vs. SPS (2/2)

◮ DBMS: runs queries just once to return a complete answer. ◮ SPS: executes standing queries, which run continuously and provide

updated answers as new data arrives.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

56 / 67

slide-74
SLIDE 74

SPS Architecture

◮ Data source: producer of streaming data. ◮ Data sink: consumer of results. ◮ Data stream is unbound and broken into a sequence of individual

data items, called tuples.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

57 / 67

slide-75
SLIDE 75

Spark Streaming

◮ Run a streaming computation as a series of very small, deterministic

batch jobs.

  • Chop up the live stream into batches of X seconds.
  • Treats each batch of data as RDDs and processes them using RDD
  • perations.
  • Finally, the processed results of the RDD operations are returned in

batches.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

58 / 67

slide-76
SLIDE 76

Discretized Stream Processing (DStream)

◮ DStream: sequence of RDDs representing a stream of data.

  • TCP sockets, Twitter, HDFS, Kafka, ...

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

59 / 67

slide-77
SLIDE 77

Discretized Stream Processing (DStream)

◮ DStream: sequence of RDDs representing a stream of data.

  • TCP sockets, Twitter, HDFS, Kafka, ...

◮ Initializing Spark streaming val scc = new StreamingContext(master, appName, batchDuration, [sparkHome], [jars])

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

59 / 67

slide-78
SLIDE 78

DStream API

◮ Transformations: modify data from on DStream to a new DStream.

  • Standard RDD operations: map, join, ...

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

60 / 67

slide-79
SLIDE 79

DStream API

◮ Transformations: modify data from on DStream to a new DStream.

  • Standard RDD operations: map, join, ...
  • Window operations: group all the records from a sliding window of the

past time intervals into one RDD: window, reduceByAndWindow, ... Window length: the duration of the window. Slide interval: the interval at which the operation is performed.

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

60 / 67

slide-80
SLIDE 80

Example 1 (1/3)

◮ Get hash-tags from Twitter. val ssc = new StreamingContext("local[2]", "test", Seconds(1)) val tweets = ssc.twitterStream(<username>, <password>)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

61 / 67

slide-81
SLIDE 81

Example 1 (2/3)

◮ Get hash-tags from Twitter. val ssc = new StreamingContext("local[2]", "test", Seconds(1)) val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap(status => getTags(status))

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

62 / 67

slide-82
SLIDE 82

Example 1 (3/3)

◮ Get hash-tags from Twitter. val ssc = new StreamingContext("local[2]", "test", Seconds(1)) val tweets = ssc.twitterStream(<username>, <password>) val hashTags = tweets.flatMap(status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...")

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

63 / 67

slide-83
SLIDE 83

Example 2

◮ Count frequency of words received in last minute. val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1)) val lines = ssc.socketTextStream(args(1), args(2).toInt) val words = lines.flatMap(_.split(" ")) val ones = words.map(x => (x, 1)) val freqs_60s = ones.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(1))

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

64 / 67

slide-84
SLIDE 84

Summary

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

65 / 67

slide-85
SLIDE 85

Summary

◮ How to store and process big data? scale up vs. scalue out ◮ Cluster programming model: dataflow ◮ Spark: RDD (transformations and actions) ◮ Spark SQL: DataFrame (RDD + schema) ◮ Spark Streaming: DStream (sequence of RDDs)

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

66 / 67

slide-86
SLIDE 86

Questions?

Amir H. Payberah (SICS) Apache Spark

  • Feb. 2, 2016

67 / 67