Iteratively Improving Spark Application Performance William C. - - PowerPoint PPT Presentation

iteratively improving spark application performance
SMART_READER_LITE
LIVE PREVIEW

Iteratively Improving Spark Application Performance William C. - - PowerPoint PPT Presentation

Iteratively Improving Spark Application Performance William C. Benton Red Hat, Inc. Forecast Background: Spark, RDDs, and Sparks execution model Case study overview Improving our prototype Background Apache Spark Introduced


slide-1
SLIDE 1

William C. Benton Red Hat, Inc.

Iteratively Improving Spark Application Performance

slide-2
SLIDE 2

Forecast

  • Background: Spark, RDDs, and

Spark’s execution model

  • Case study overview
  • Improving our prototype
slide-3
SLIDE 3

Background

slide-4
SLIDE 4

Apache Spark

  • Introduced in 2009; donated to

Apache in 2013; 1.0 release in 2014

  • Based on a fundamental abstraction:

the resilient distributed dataset

  • Supports in-memory computing and

a wide range of algorithms

slide-5
SLIDE 5

Spark is general Spark core

slide-6
SLIDE 6

Spark is general Spark core Graph

slide-7
SLIDE 7

Spark is general Spark core Graph SQL

slide-8
SLIDE 8

Spark is general Spark core Graph SQL ML

slide-9
SLIDE 9

Spark is general Spark core Graph SQL ML Streaming

slide-10
SLIDE 10

Spark is general Spark core Graph SQL ML Streaming ad hoc Mesos YARN

slide-11
SLIDE 11

Spark is general Spark core Graph SQL ML Streaming ad hoc Mesos YARN

APIs for SCALA, Java, Python, and R (3rd-party bindings for Clojure et al.)

slide-12
SLIDE 12

Resilient distributed datasets

slide-13
SLIDE 13

Resilient distributed datasets

slide-14
SLIDE 14

Resilient distributed datasets

slide-15
SLIDE 15

Resilient distributed datasets

slide-16
SLIDE 16

Resilient distributed datasets

Partitioned across machines by range…

slide-17
SLIDE 17

Resilient distributed datasets

Partitioned across machines by range…

slide-18
SLIDE 18

Resilient distributed datasets

Partitioned across machines by range… …or BY HASH

slide-19
SLIDE 19

Resilient distributed datasets

slide-20
SLIDE 20

Resilient distributed datasets

Failures mean Partitions can disappear… ?

slide-21
SLIDE 21

Resilient distributed datasets

Failures mean Partitions can disappear… …but they can be reconstructed!

slide-22
SLIDE 22

RDDs are partitioned, immutable, lazy collections

slide-23
SLIDE 23

RDDs are partitioned, immutable, lazy collections

TRANSFORMATIONS create new RDDs that encode a dependency DAG ACTIONS result in executing cluster jobs & return values to the driver

slide-24
SLIDE 24

Creating RDDs

  • From a collection: parallelize()
  • …a local or remote file: textFile()
  • …or HDFS: hadoopFile();

sequenceFile(); objectFile()

  • (These all act lazily)
slide-25
SLIDE 25

RDD[T] transformations

  • map(f: T=>U): RDD[U]
  • flatMap(f: T=>Seq[U]): RDD[U]
  • filter(f: T=>Boolean): RDD[T]
  • distinct(): RDD[T]
  • keyBy(f: T=>K): RDD[(K, T)]
slide-26
SLIDE 26

RDD[(K,V)] transformations

  • sortByKey(): RDD[(K,V)]
  • groupByKey(): RDD[(K,Seq[V])]
  • reduceByKey(f: (V,V)=>V):

RDD[(K,V)]

  • join(other: RDD[(K,W)]):

RDD[(K,(V,W))]

slide-27
SLIDE 27

RDD[(K,V)] transformations

  • sortByKey(): RDD[(K,V)]
  • groupByKey(): RDD[(K,Seq[V])]
  • reduceByKey(f: (V,V)=>V):

RDD[(K,V)]

  • join(other: RDD[(K,W)]):

RDD[(K,(V,W))]

…and many Others!

slide-28
SLIDE 28

RDD[T] actions

  • collect(): Array[T]
  • count(): Long
  • reduce(f: (T,T)=>T): T
  • saveAsTextFile(path)
  • saveAsSequenceFile(path)
slide-29
SLIDE 29

RDD[T] actions

  • collect(): Array[T]
  • count(): Long
  • reduce(f: (T,T)=>T): T
  • saveAsTextFile(path)
  • saveAsSequenceFile(path)

(Remember: ALL actions are eager) …and many Others!

slide-30
SLIDE 30

Example: word count in Spark

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

slide-31
SLIDE 31

Example: word count in Spark

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

slide-32
SLIDE 32

Example: word count in Spark

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

slide-33
SLIDE 33

Example: word count in Spark

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

slide-34
SLIDE 34

Example: word count in Spark

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

slide-35
SLIDE 35

Example: word count in Spark

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

slide-36
SLIDE 36

Case study: bicycling data

slide-37
SLIDE 37

Metrics available to cyclists

direct derivative or lagging effort cadence torque power heart rate calories training stress course position elevation temperature grade speed distance

slide-38
SLIDE 38

Metrics available to cyclists

direct derivative or lagging effort cadence torque power heart rate calories training stress course position elevation temperature grade speed distance

(Typically, One sample per second)

slide-39
SLIDE 39

Metrics available to cyclists

direct derivative or lagging effort cadence torque power heart rate calories training stress course position elevation temperature grade speed distance

(Typically, One sample per second)

slide-40
SLIDE 40

“Where should I do intervals?”

slide-41
SLIDE 41 405’ 3,970’ 10.5 miles
  • Mt. Diablo (from North gate)
Near Walnut Creek, CA 890’ 16.8 miles 1,214’ “BUDDHA’s PALM” near Cross Plains, WI
slide-42
SLIDE 42

Finding best efforts

A four-minute all-out effort includes sixty strong three-minute efforts.

913’ 956’ 1,156’ 1.2 miles TWIN VALLEY TIME TRIAL Middleton, WI (0.8 miles at ~5.7%)
slide-43
SLIDE 43 Middleton Wauna Verona US 12 US 12

Start by finding CLUSTERS

  • f SPATIAL points…
slide-44
SLIDE 44 Middleton Wauna Verona US 12 US 12

Start by finding CLUSTERS

  • f SPATIAL points…

…and then consider only the best effort between a pair of clusters!

slide-45
SLIDE 45 Middleton Waunak Verona US 12 US 12 V e r
  • n
a R
  • a
d MS
slide-46
SLIDE 46 Middleton Waunak Verona US 12 US 12 V e r
  • n
a R
  • a
d MS Local Criterium Course Highest Point in southern WIsconsin
slide-47
SLIDE 47 Middleton Waunak Verona US 12 US 12 V e r
  • n
a R
  • a
d MS Local Criterium Course Highest Point in southern WIsconsin CLIMBING section from WI IRONMAN
slide-48
SLIDE 48 Middleton Waunak Verona US 12 US 12 V e r
  • n
a R
  • a
d MS Local Criterium Course Highest Point in southern WIsconsin CLIMBING section from WI IRONMAN TWIN VALLEY
slide-49
SLIDE 49

The prototype

slide-50
SLIDE 50

Windowed processing

20140909.tcx
slide-51
SLIDE 51

Windowed processing

20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)

. . .

slide-52
SLIDE 52

Windowed processing

20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)

. . .

Keep & PLOT the best windows for each spatial cluster pair!

start: cluster 0 end: cluster 1 start: cluster 1 end: cluster 2 start: cluster 2 end: cluster 0
slide-53
SLIDE 53 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp }
slide-54
SLIDE 54 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data: RDD[TP]
slide-55
SLIDE 55 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data: RDD[TP] period: Int
slide-56
SLIDE 56 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data: RDD[TP] period: Int xform: (TP => U) = identity _
slide-57
SLIDE 57 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) data: RDD[TP] period: Int xform: (TP => U) = identity _
slide-58
SLIDE 58 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap data: RDD[TP] period: Int xform: (TP => U) = identity _
slide-59
SLIDE 59 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } data: RDD[TP] period: Int xform: (TP => U) = identity _
slide-60
SLIDE 60 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } data: RDD[TP] period: Int xform: (TP => U) = identity _

Transform an RDD of TRACKPOINTS… …TO an RDD of WINDOW IDS and SAMPLE WINDOWS

slide-61
SLIDE 61 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _).cache val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) } val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) } // continued...
slide-62
SLIDE 62 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _).cache val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) } val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) } // continued... val windowedSamples = windowsForActivities(data, period, minify _).cache

Divide the input data into

  • verlapping windows, keyed

by ACTIVITY and OFFSET (we’ll call this key a WINDOW ID)

slide-63
SLIDE 63 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _).cache val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) } val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) } // continued... val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) }

Identify the spatial clusters that each window starts and ends in

slide-64
SLIDE 64 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _).cache val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) } val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) } // continued... val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) }

Identify the MEAN WATTAGE for each window of samples

slide-65
SLIDE 65 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }
slide-66
SLIDE 66 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } mmps.join(clusterPairs)

for each window ID, JOIN its mean wattages with its spatial clusters

slide-67
SLIDE 67 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) }

transpose these tuples so they are keyed by spatial cluster pairs

slide-68
SLIDE 68 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }

KEEP ONLY the BEST wattage for each spatial cluster pair

.reduceByKey ((a, b) => if (a._1 > b._1) a else b)
slide-69
SLIDE 69 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) }

project away the cluster centers

slide-70
SLIDE 70 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .sortByKey(false) .take(20)

SORT by wattage in descending

  • rder; keep the best twenty
slide-71
SLIDE 71 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }

Re-key the best efforts by window id

.map { case (watts, (act, idx)) => ((act, idx), watts) }
slide-72
SLIDE 72 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) }

get the actual sample windows for each effort; project away IDs

slide-73
SLIDE 73 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .collect
slide-74
SLIDE 74

Improving the prototype

slide-75
SLIDE 75

Broadcast large static data

slide-76
SLIDE 76

Broadcast variables

// phoneBook maps (given name, surname) -> phone number digits val phoneBook: Map[(String, String), String] = initPhoneBook() val names: RDD[(String, String)] = /* ... */ val directory = names.map { case name @ (first, last) => (name, phoneBook.getOrElse("555-1212")) }
slide-77
SLIDE 77

Broadcast variables

// phoneBook maps (given name, surname) -> phone number digits val phoneBook: Map[(String, String), String] = initPhoneBook() val names: RDD[(String, String)] = /* ... */ val directory = names.map { case name @ (first, last) => (name, phoneBook.getOrElse("555-1212")) }

phoneBook will be copied and deserialized for each task!

slide-78
SLIDE 78

Broadcast variables

// phoneBook maps (given name, surname) -> phone number digits val phoneBook: Map[(String, String), String] = initPhoneBook() val names: RDD[(String, String)] = /* ... */ val pbb = sparkContext.broadcast(phoneBook) val directory = names.map { case name @ (first, last) => (name, pbb.value.getOrElse(“555-1212")) }

Broadcasting phoneBook means it can be deserialized once and cached on each Node!

slide-79
SLIDE 79 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _) val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) // ... }
slide-80
SLIDE 80 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model.value)) // rest of function unchanged }
slide-81
SLIDE 81

Cache only when necessary

slide-82
SLIDE 82 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period).cache // ... val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) // ... }
slide-83
SLIDE 83 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period).cache // ... val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) // ... }

Keeping EVERY WINDOW IN MEMORY…

slide-84
SLIDE 84 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period).cache // ... val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) // ... }

Keeping EVERY WINDOW IN MEMORY… …EVEN THOUGH RECOMPUTING windows IS incredibly CHEAP AND YOU’ll need

  • nly a tiny fraction of windows?
slide-85
SLIDE 85 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period).cache // ... val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) // ... }

Eliminating unnecessary memory pressure can lead to a substantial speedup!

slide-86
SLIDE 86

Avoid shuffles when possible

slide-87
SLIDE 87

tasks

slide-88
SLIDE 88

stages

slide-89
SLIDE 89

stages we want to Avoid all unnecessary shuffles

slide-90
SLIDE 90 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }
slide-91
SLIDE 91 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }
slide-92
SLIDE 92 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }

start and end clusters

slide-93
SLIDE 93 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }

start and end clusters window ids and mean wattages

slide-94
SLIDE 94 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep }

eliminate a join and a transpose

slide-95
SLIDE 95 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }

(use the right API calls!)

.takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }
slide-96
SLIDE 96 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect

eliminate a transpose

slide-97
SLIDE 97 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect

eliminate a transpose

app.context.parallelize(top20) .join (windowedSamples) .collect .map { case ((act, idx), (watts, samples)) => (watts, samples) }

…or two!

slide-98
SLIDE 98

Embrace laziness

slide-99
SLIDE 99

Embrace laziness

(only pay for what you use)

slide-100
SLIDE 100

Windowed processing redux

20140909.tcx
slide-101
SLIDE 101

Windowed processing redux

20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)

. . .

slide-102
SLIDE 102

Windowed processing redux

20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)

. . .

start: cluster 0 end: cluster 1 start: cluster 1 end: cluster 2 start: cluster 2 end: cluster 0
slide-103
SLIDE 103

Lazy windowed processing

20140909.tcx
slide-104
SLIDE 104

Lazy windowed processing

20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)

. . .

watts: begin: 0:03 end: 0:32 20140909.tcx watts: begin: 0:02 end: 0:31 20140909.tcx watts: begin: 0:01 end: 0:30 20140909.tcx watts: begin: 0:00 end: 0:29 20140909.tcx watts: begin: 54:17 end: 54:46 20140909.tcx
slide-105
SLIDE 105

Lazy windowed processing

20140909.tcx

. . .

start: cluster 0 end: cluster 1 start: cluster 1 end: cluster 2 start: cluster 2 end: cluster 0 watts: begin: 0:03 end: 0:32 20140909.tcx watts: begin: 0:02 end: 0:31 20140909.tcx watts: begin: 0:01 end: 0:30 20140909.tcx watts: begin: 0:00 end: 0:29 20140909.tcx watts: begin: 54:17 end: 54:46 20140909.tcx start: cluster 0 end: cluster 1 start: cluster 1 end: cluster 2
slide-106
SLIDE 106

Lazy windowed processing

20140909.tcx

. . .

start: cluster 0 end: cluster 1 start: cluster 2 end: cluster 0 watts: begin: 0:03 end: 0:32 20140909.tcx watts: begin: 0:01 end: 0:30 20140909.tcx watts: begin: 54:17 end: 54:46 20140909.tcx start: cluster 1 end: cluster 2
slide-107
SLIDE 107

Lazy windowed processing

20140909.tcx

. . .

start: cluster 0 end: cluster 1 start: cluster 2 end: cluster 0 watts: begin: 0:03 end: 0:32 20140909.tcx watts: begin: 0:01 end: 0:30 20140909.tcx watts: begin: 54:17 end: 54:46 20140909.tcx start: cluster 1 end: cluster 2
slide-108
SLIDE 108 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} // ... def applyWindowedNoZip[U: ClassTag](data: RDD[TP], period: Int, xform: ((String, Seq[TP]) => U)) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity: String, stp:Seq[TP]) => (stp sliding period).map { xform(activity, _) } } } }
slide-109
SLIDE 109 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} // ... def applyWindowedNoZip[U: ClassTag](data: RDD[TP], period: Int, xform: ((String, Seq[TP]) => U)) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity: String, stp:Seq[TP]) => (stp sliding period).map { xform(activity, _) } } } }

Perform Arbitrary TRANSFORMations ON each window of samples

slide-110
SLIDE 110

case class Effort(mmp: Double, activity: String, startTimestamp: Long, endTimestamp: Long) {} Model EFFORT SUMMARIES as simple, lightweight records

slide-111
SLIDE 111 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = applyWindowedNoZip(data, period, { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }) // continued }
slide-112
SLIDE 112 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = applyWindowedNoZip(data, period, { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }) // continued } { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }
slide-113
SLIDE 113 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = applyWindowedNoZip(data, period, { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }) // continued } { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }
slide-114
SLIDE 114 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = applyWindowedNoZip(data, period, { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }) // continued } { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }
slide-115
SLIDE 115 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } }
slide-116
SLIDE 116 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b)
slide-117
SLIDE 117 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp })
slide-118
SLIDE 118 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) }
slide-119
SLIDE 119 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) }
slide-120
SLIDE 120 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) }
slide-121
SLIDE 121 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) }
slide-122
SLIDE 122

Conclusions

slide-123
SLIDE 123

Avoid shuffles when possible

2x
slide-124
SLIDE 124

Broadcast large static data

1.1x

Avoid shuffles when possible

2x
slide-125
SLIDE 125

Broadcast large static data

1.1x

Cache only when necessary

2x

Avoid shuffles when possible

2x
slide-126
SLIDE 126

Broadcast large static data

1.1x

Cache only when necessary

2x

Avoid shuffles when possible

2x

Embrace laziness

(only pay for what you use)

14x

slide-127
SLIDE 127

Broadcast large static data

1.1x

Cache only when necessary

2x

Avoid shuffles when possible

2x

Embrace laziness

(only pay for what you use)

14x

Work WITH Spark, not against it

slide-128
SLIDE 128

Thanks!

willb@redhat.com http://chapeau.freevariable.com @willb