[S PARK ] Shrideep Pallickara Computer Science Colorado State - - PDF document

s park
SMART_READER_LITE
LIVE PREVIEW

[S PARK ] Shrideep Pallickara Computer Science Colorado State - - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] October


slide-1
SLIDE 1

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.1

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS 555: DISTRIBUTED SYSTEMS

[SPARK]

Shrideep Pallickara Computer Science Colorado State University

October 10, 2019

L14.1 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.2 Professor: SHRIDEEP PALLICKARA

Frequently asked questions from the previous class survey

October 10, 2019

slide-2
SLIDE 2

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.2

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.3 Professor: SHRIDEEP PALLICKARA

Topics covered in this lecture

¨ Transformations and Actions ¤ RDDs ¤ DataFrames

October 10, 2019 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

COMMON TRANSFORMATIONS AND ACTIONS

October 10, 2019

L14.4

slide-3
SLIDE 3

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.3

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.5 Professor: SHRIDEEP PALLICKARA

Element-wise transformations: filter()

October 10, 2019

¨ Takes in a function and returns an RDD that only has elements that pass

the filter() function

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.6 Professor: SHRIDEEP PALLICKARA

Mapped RDD {1, 4, 9, 16} Filtered RDD {2,3,4}

Element-wise transformations: map()

October 10, 2019

¨ Takes in a function and applies it to each element in the RDD ¨ Result of the function is the new value of each element in the resulting

RDD

inputRDD {1,2,3,4} map x => x*x filter x => x !=1

slide-4
SLIDE 4

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.4

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.7 Professor: SHRIDEEP PALLICKARA

Things that can be done with map()

October 10, 2019

¨ Fetch website associated with each URL in collection to just squaring

numbers

¨ map()’s return type does not have to be the same as its input type

¨ Multiple output elements for each input element? ¤ Use flatMap() lines=sc.parallelize([“hello world”, “hi”]) words=lines.flatMap(lambda line: line.split(“ “) ) words.first() # returns hello

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.8 Professor: SHRIDEEP PALLICKARA

Difference between map and flatMap

October 10, 2019

RDD1 {“coffee panda”, “happy panda”, “happiest panda party”} mappedRDD {[“coffee”, “panda”], [“happy”, “panda”], [“happiest”, “panda”, “party”]} flatMappedRDD {“coffee”, “panda”, “happy”, “panda”, “happiest”, “panda”, “party”} RDD1.flatMap(tokenize) RDD1.map(tokenize)

slide-5
SLIDE 5

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.5

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.9 Professor: SHRIDEEP PALLICKARA

Psuedo set operations

October 10, 2019

¨ RDDs support many of the operations of mathematical sets such as

union, intersection, etc.

¤ Even when the RDDs themselves are not properly sets

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.10 Professor: SHRIDEEP PALLICKARA

Some simple set operations

October 10, 2019

RDD1 {coffee, coffee, panda, monkey, tea} RDD2 {coffee, monkey, kitty} RDD1.distinct() {coffee, monkey, panda, tea} RDD1.union(RDD2) {coffee, coffee, coffee, panda, monkey, monkey, tea, kitty} RDD1.intersection(RDD2) {coffee, monkey} RDD1.subtract(RDD2) {panda, tea}

slide-6
SLIDE 6

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.6

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.11 Professor: SHRIDEEP PALLICKARA

Cartesian product between two RDDs

October 10, 2019

RDD1 {User1, User2, User3} RDD2 {Venue(“Betabrand”), Venue(“Asha Tree House”), Venue(“Ritual”)} RDD1.cartesian(RDD2) { (User1, Venue(“Betabrand”)), (User1,Venue(“Asha Tree House”)), (User1,Venue(“Ritual”)), (User2, Venue(“Betabrand”)), (User2,Venue(“Asha Tree House”)), (User2,Venue(“Ritual”)), (User3, Venue(“Betabrand”)), (User3,Venue(“Asha Tree House”)), (User3,Venue(“Ritual”)) } cartesian

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

COMMON ACTIONS

October 10, 2019

L14.12

slide-7
SLIDE 7

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.7

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.13 Professor: SHRIDEEP PALLICKARA

Actions on Basic RDDs

October 10, 2019

¨ reduce()

¤ Takes a function that operates on two elements in the RDD; returns an

element of the same type

n E.g. of such an operation? + sums the RDD

sum = rdd.reduce(lambda x, y: x+ y)

¨ fold() takes a function with the same signature as reduce(), but

also takes a “zero value” for initial call

¤ “Zero value” is the identity element for initial call ¤ E.g., 0 for +, 1 for *, empty list for concatenation

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.14 Professor: SHRIDEEP PALLICKARA

Both fold() and reduce() require return type to be of the same type as the RDD elements

October 10, 2019

¨ The aggregate() removes that constraint ¤ For e.g. when computing a running average, maintain both the count so far

and the number of elements

slide-8
SLIDE 8

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.8

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

EXAMPLES: BASIC ACTIONS ON RDDS

October 10, 2019

L14.15 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.16 Professor: SHRIDEEP PALLICKARA

Examples: Basic actions on RDDs [1/7]

October 10, 2019

¨ Our RDD contains {1, 2, 3, 3}

¨ collect()

¤ Return all elements from the RDD ¤ Invocation:

rdd.collect()

¤ Result:

{1, 2, 3, 3}

slide-9
SLIDE 9

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.9

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.17 Professor: SHRIDEEP PALLICKARA

Examples: Basic actions on RDDs [2/7]

October 10, 2019

¨ Our RDD contains {1, 2, 3, 3}

¨ count()

¤ Number of elements in the RDD ¤ Invocation:

rdd.count()

¤ Result:

4

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.18 Professor: SHRIDEEP PALLICKARA

Examples: Basic actions on RDDs [3/7]

October 10, 2019

¨ Our RDD contains {1, 2, 3, 3}

¨ countByValue()

¤ Number of times each element occurs in the RDD ¤ Invocation:

rdd.countByValue()

¤ Result:

{ (1,1), (2,1), (3,2) }

slide-10
SLIDE 10

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.10

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.19 Professor: SHRIDEEP PALLICKARA

Examples: Basic actions on RDDs [4/7]

October 10, 2019

¨ Our RDD contains {1, 2, 3, 3}

¨ take(num)

¤ Return num elements from the RDD ¤ Invocation:

rdd.take(2)

¤ Result:

{ 1, 2}

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.20 Professor: SHRIDEEP PALLICKARA

Examples: Basic actions on RDDs [5/7]

October 10, 2019

¨ Our RDD contains {1, 2, 3, 3}

¨ reduce(func)

¤ Combine the elements of the RDD together in parallel ¤ Invocation:

rdd.reduce( (x,y) => x + y )

¤ Result:

9

slide-11
SLIDE 11

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.11

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.21 Professor: SHRIDEEP PALLICKARA

Examples: Basic actions on RDDs [6/7]

October 10, 2019

¨ Our RDD contains {1, 2, 3, 3}

¨ aggregate(zeroValue)(seqOp, combOp)

¤ Similar to reduce() but used to return a different type ¤ Invocation:

n rdd.aggregate((0,0))

(x,y) => (x._1 + y, x._2 +1), (x,y) => (x._1 + y._1, x._2 + y._2))

¤ Result:

(9, 4)

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.22 Professor: SHRIDEEP PALLICKARA

Examples: Basic actions on RDDs [7/7]

October 10, 2019

¨ Our RDD contains {1, 2, 3, 3}

¨ foreach(func)

¤ Apply the provided function to each element of the RDD ¤ Invocation:

rdd.foreach(func)

¤ Result:

Nothing

slide-12
SLIDE 12

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.12

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

PERISTENCE (CACHING)

October 10, 2019

L14.23 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.24 Professor: SHRIDEEP PALLICKARA

Why persistence?

October 10, 2019

¨ Spark RDDs are lazily evaluated, and we may sometimes wish to use

the same RDD multiple times

¤ Naively, Spark will recompute RDD and all of its dependencies each time

we call an action on the RDD

n Super expensive for iterative algorithms ¨ To avoid recomputing RDD multiple times? ¤ Ask Spark to persist the data ¤ The nodes that compute the RDD, store the partitions

slide-13
SLIDE 13

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.13

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.25 Professor: SHRIDEEP PALLICKARA

Coping with failures

October 10, 2019

¨ If a node that has data persisted on it fails? ¤ Spark recomputes lost partitions of data when needed ¨ Also, replicate data on multiple nodes ¤ To handle node failures without slowdowns

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.26 Professor: SHRIDEEP PALLICKARA

Persistence Levels for Spark

October 10, 2019

Level Space Used CPU time In Memory On disk Comments

MEMORY_ONLY High Low Y N MEMORY_ONLY_SER Low High Y N MEMORY_AND_DISK High Medium Some Some Spills to disk if there is too much data to fit in memory MEMORY_AND_DISK _SER Low High Some Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory DISK_ONLY Low High N Y

slide-14
SLIDE 14

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.14

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.27 Professor: SHRIDEEP PALLICKARA

What if you attempt to cache too much data to fit in memory?

October 10, 2019

¨ Spark will evict old partitions using a Least Recently Used Cache

policy

¤ For memory only storage partitions, it will be recomputed the next time they

are accessed

¤ For memory_and_disk ones? Write them out to disk ¨ RDDs also come with a method, unpersist() ¤ Manually remove data elements from the cache

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

WORKING WITH KEY/VALUE PAIRS

October 10, 2019

L14.28

slide-15
SLIDE 15

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.15

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.29 Professor: SHRIDEEP PALLICKARA

RDDs of key/value pairs

October 10, 2019

¨ Key/value RDDs are commonly used to perform aggregations ¤ Might have to do ETL (Extract, Transform, and Load) to get data into

key/value formats

¨ Advanced feature to control layout of pair RDDs across nodes ¤ Partitioning

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.30 Professor: SHRIDEEP PALLICKARA

RDDs containing key/value pairs

October 10, 2019

¨ Are called pair RDDs ¨ Useful building block in many programs ¤ Expose operations that allow actions on each key in parallel or regroup

data across network

¤ reduceByKey() to aggregate data separately for each key ¤ join() to merge two RDDs together by grouping elements of the same key

slide-16
SLIDE 16

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.16

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.31 Professor: SHRIDEEP PALLICKARA

Pair RDDs

October 10, 2019

¨ RDDs that contain key/value pairs ¨ Expose partitions that allow you to act on each key in parallel or

regroup data across the network

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.32 Professor: SHRIDEEP PALLICKARA

Creating Pair RDDs

October 10, 2019

¨ pairs=lines.map(lambda x: (x.split(“ ”) )[0], x))

¤ Creates a pairRDD using the first word as the key ¨ Java does not have a built-in tuple type

¤ scala.Tuple2 class n new Tuple2(elem1, elem2)

slide-17
SLIDE 17

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.17

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

TRANSFORMATIONS ON PAIR RDDS

October 10, 2019

L14.33 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.34 Professor: SHRIDEEP PALLICKARA

Transformations on Pair RDDs [1/5]

October 10, 2019

¨ Pair RDD = {(1,2), (3,4), (3,6) }

¨ reduceByKey(func)

¤ Combine values with the same key ¤ Invocation: rdd.reduceByKey((x, y) => x + y) ¤ Result: { (1, 2), (3,10) }

slide-18
SLIDE 18

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.18

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.35 Professor: SHRIDEEP PALLICKARA

Transformations on Pair RDDs [2/5]

October 10, 2019

¨ Pair RDD = {(1,2), (3,4), (3,6) }

¨ groupByKey(func)

¤ Group values with the same key ¤ Invocation: rdd.groupByKey() ¤ Result: { (1, [2]), (3, [4, 6]) }

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.36 Professor: SHRIDEEP PALLICKARA

Transformations on Pair RDDs [3/5]

October 10, 2019

¨ Pair RDD = {(1,2), (3,4), (3,6) }

¨ mapValues(func)

¤ Apply function to each value of a pair RDD without changing the key ¤ Invocation: rdd.mapValues(x=> x+1) ¤ Result: { (1, 3), (3, 5) , (3, 7) }

slide-19
SLIDE 19

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.19

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.37 Professor: SHRIDEEP PALLICKARA

Transformations on Pair RDDs [4/5]

October 10, 2019

¨ Pair RDD = {(1,2), (3,4), (3,6) }

¨ values()

¤ Return an RDD of just the values ¤ Invocation: rdd.values() ¤ Result: { 2, 4, 6 }

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.38 Professor: SHRIDEEP PALLICKARA

Transformations on Pair RDDs [5/5]

October 10, 2019

¨ Pair RDD = {(1,2), (3,4), (3,6) }

¨ sortByKey()

¤ Return an RDD sorted by the key ¤ Invocation: rdd.sortByKey() ¤ Result: { (1,2), (3,4), (3,6 }

slide-20
SLIDE 20

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.20

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

TRANSFORMATIONS ON TWO PAIR RDDS

October 10, 2019

L14.39 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.40 Professor: SHRIDEEP PALLICKARA

Transformations on two Pair RDDs [1/5]

October 10, 2019

¨ rdd = {(1,2), (3,4), (3,6) } other = {(3,9)}

¨ subtractByKey()

¤ Remove elements with a key present in the other RDD ¤ Invocation: rdd.subtractByKey(other) ¤ Result: { (1,2) }

slide-21
SLIDE 21

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.21

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.41 Professor: SHRIDEEP PALLICKARA

Transformations on two Pair RDDs [2/5]

October 10, 2019

¨ rdd = {(1,2), (3,4), (3,6) } other = {(3,9)}

¨ join()

¤ Perform an inner join between two RDDs. Only keys that are present in both

pair RDDs are output

¤ Invocation: rdd.join(other) ¤ Result: { (3, (4,9)) , (3, (6,9)) }

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.42 Professor: SHRIDEEP PALLICKARA

Transformations on two Pair RDDs [3/5]

October 10, 2019

¨ rdd = {(1,2), (3,4), (3,6) } other = {(3,9)}

¨ leftOuterJoin()

¤ Perform a join between two RDDs where the key must be present in the first

RDD.

¤ Value associated with each key is a tuple of the value from the source and

an Option for the value from the other pair RDD

n In python if a value is not present, None is used. ¤ Invocation: rdd.leftOuterJoin(other) ¤ Result: { (1, (2,None)) , (3, (4, 9)) , (3, (6, 9)) }

slide-22
SLIDE 22

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.22

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.43 Professor: SHRIDEEP PALLICKARA

Transformations on two Pair RDDs [4/5]

October 10, 2019

¨ rdd = {(1,2), (3,4), (3,6) } other = {(3,9)}

¨ rightOuterJoin()

¤ Perform a join between two RDDs where the key must be present in the

  • ther RDD;

¤ Tuple has an option for the source rather than other RDD ¤ Invocation: rdd.rightOuterJoin(other) ¤ Result: { (3, (4,9) ) , (3, (6,9)) }

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.44 Professor: SHRIDEEP PALLICKARA

Transformations on two Pair RDDs [5/5]

October 10, 2019

¨ rdd = {(1,2), (3,4), (3,6) } other = {(3,9)}

¨ cogroup()

¤ Group data from both RDDs using the same key ¤ Invocation: rdd.cogroup(other) ¤ Result: { (1, ([2],[])) , (3, ([4, 6], [9])) }

slide-23
SLIDE 23

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.23

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.45 Professor: SHRIDEEP PALLICKARA

Example of chaining operations

October 10, 2019

key value panda pink 3 pirate 3 panda 1 pink 4 key value panda (0, 1) pink (3, 1) pirate (3, 1) panda (1, 1) pink (4, 1) mapValues key value panda (1, 2) pink (7, 2) pirate (3, 1) r e d u c e B y K e y

rdd.mapValues(x=> (x, 1)).reduceByKey( (x,y) => (x._1 + y._1, x._2 + y._2))

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.46 Professor: SHRIDEEP PALLICKARA

A word count example

October 10, 2019

¨ We are using flatMap() to produce a pair RDD of words and the

number 1

rdd = sc.textfile(“s3://…”) words = rdd.flatMap(lambda x: x.split(“ ”)) result = words.map(lambda x: (x,1)). reduceByKey(lambda x, y: x+y)

slide-24
SLIDE 24

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.24

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

DATAFRAMES

October 10, 2019

L14.47 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.48 Professor: SHRIDEEP PALLICKARA

Dataframe Review

October 10, 2019

¨ Maintains features of RDD’s ¤ In-memory, resilient, distributed computing ¤ Supports same transformations and actions ¤ API’s in a variety of languages ¨ Differs from RDD’s by ¤ Maintenance of data schema ¤ Additional optimizations to query plan (Catalyst rules)

slide-25
SLIDE 25

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.25

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.49 Professor: SHRIDEEP PALLICKARA

Dataframe sources

October 10, 2019

¨ Dataframes may be initialized from a variety of sources ¤ Distributed File Systems (CSV, Json, XML, etc) ¤ Databases (MySQL, Cassandra, Hive, Redis, etc) ¤ RDD’s ¨ Able to “inferSchema” of structured data

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.50 Professor: SHRIDEEP PALLICKARA

Example: Dataframe Initialization

October 10, 2019 ¨ val df1 = sqlContext.read.format("csv")

.option("header", "true") .option(“inferSchema”, “true”) .load("hdfs://albany:56781/data/main/1987.csv") val df2 = sqlContext.read .format(“org.apache.spark.sql.cassandra”) .option(“table”, “students”) .option(“keyspace”, “csu”).load()

slide-26
SLIDE 26

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.26

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

COMMON DATAFRAME OPERATIONS

October 10, 2019

L14.51 CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.52 Professor: SHRIDEEP PALLICKARA

Column Manipulations [1/4]

October 10, 2019

¨ withColumn(columnName, func)

¤ Return an Dataframe with the additional column ¤ Invocation: df.withColumn(“dogYears”, df.age / 7)

¨ dropColumn(columnName)

¤ Return an Dataframe without the column ¤ Invocation: df.dropColumn(“age”)

slide-27
SLIDE 27

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.27

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.53 Professor: SHRIDEEP PALLICKARA

Column Manipulations [2/4]

October 10, 2019

¨ select(columnNames)

¤ Return an Dataframe with the specified columns ¤ Invocation: df.select(“firstName”, “age”)

¨ describe(columnName)

¤ Compute summary statistics over Dataframe columns ¤ Invocation: df.describe(“age”), df.describe()

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.54 Professor: SHRIDEEP PALLICKARA

Column Manipulations [3/4]

October 10, 2019

val df = Seq( (“Peterson”, “Marcus”, 54), (“Batey”, “Edward”, 36), (“Bruce”, "Karen", 35) ).toDF("lastName", “firstName”, "age”) df.withColumn(“dogYears”, df.age / 7.0) df.describe(“age”, “dogYears”)

slide-28
SLIDE 28

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.28

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.55 Professor: SHRIDEEP PALLICKARA

Column Manipulations [4/4]

October 10, 2019

+-------+---------+---------+ |summary| age| dogYears| +-------+---------+---------+ | count| 3| 3| | mean| 41.6667| 5.95238| | stddev| 10.69268| 1.52753| | min| 35| 5| | max| 54| 7.714286| +-------+---------+---------+

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.56 Professor: SHRIDEEP PALLICKARA

Dataframe joins

October 10, 2019

¨ join(other, <columnComparison>, <joinType>)

¤ Performs a join between 2 Dataframes ¤ Invocation: df1.join(df2, Seq(“id”))

slide-29
SLIDE 29

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.29

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.57 Professor: SHRIDEEP PALLICKARA

Join column comparison

October 10, 2019

¨ Supports a variety of criteria ¤ Sequence of column names (ex. Seq(“id”, “age”)) ¤ Elaborate comparison definitions (ex. df1(“age”) >= df2(“age”))

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.58 Professor: SHRIDEEP PALLICKARA

Join Type

October 10, 2019

¨ Dataframes may perform multiple styles of join ¤ Inner: typical dataset join with key to key match ¤ Outer, left-outer, right-outer: result contains all rows, filling in columns with

‘null’ values where data doesn’t exist

¤ Left-semi, right-semi: similar to outer join, but result only contains rows in

specified source dataset

slide-30
SLIDE 30

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.30

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.59 Professor: SHRIDEEP PALLICKARA

Example: Spark SQL

October 10, 2019

val df = Seq( (“Peterson”, “Marcus”, 54), (“Batey”, “Edward”, 36), (“Bruce”, "Karen", 35) ).toDF("lastName", “firstName”, "age”) df.createOrReplaceTempView(“people”) spark.sql(“SELECT firstName, age, age / 7.0 as dogYears FROM people where age < 50”)

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

TUNING THE LEVEL OF PARALLELISM

October 10, 2019

L14.60

slide-31
SLIDE 31

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.31

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.61 Professor: SHRIDEEP PALLICKARA

Tuning the level of parallelism

October 10, 2019

¨ Every RDD has a fixed number of partitions ¤ Determine the degree of parallelism when executing operations ¨ During aggregations or grouping operations, you can ask Spark to use

a specific number of partitions

¤ This will override defaults that Spark uses

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.62 Professor: SHRIDEEP PALLICKARA

Example: Tuning the level of parallelism

October 10, 2019

data = [(“a”, 3), (“b”, 4), (“a”, 1)] sc.parallelize(data). reduceByKey(lambda x, y: x+y) #default sc.parallelize(data). reduceByKey(lambda x, y: x+y, 10) #Custom

slide-32
SLIDE 32

SLIDES CREATED BY: SHRIDEEP PALLICKARA L14.32

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.63 Professor: SHRIDEEP PALLICKARA

What if you want to tune parallelism outside of grouping and aggregation operations?

October 10, 2019

¨ There is repartition() ¤ Shuffles data across the network to create a new set of partitions ¤ Very expensive operation! ¨ There is the coalesce() operation ¤ Allow avoiding data movement n But only if you are decreasing the number of partitions ¤ Check rdd.getNumPartitions() and make sure you are coalescing to

fewer partitions

CS555: Distributed Systems [Fall 2019]

  • Dept. Of Computer Science, Colorado State University

L14.64 Professor: SHRIDEEP PALLICKARA

The contents of this slide-set are based on the following references

¨ Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. Learning Spark:

Lightning-Fast Big Data Analysis. 1st Edition. O'Reilly. 2015. ISBN-13: 978-

  • 1449358624. [Chapters 1-4]

¨ Karau, Holden; Warren, Rachel. High Performance Spark. Best Practices for Scaling

and Optimizing Apache Spark. O'Reilly Media. 2017. ISBN-13: 978-1491943205. [Chapter 2]

¨ Chambers, Bill, Zaharia, Matei. Spark: The Definitive Guide. Big Data Processing

Made Simple. O'Reilly Media. ISBN-13: 978-1491912218. 2018. [Chapters 2, 3, 4, 15, and 16].

October 10, 2019