SLIDE 1 William C. Benton Red Hat, Inc.
Iteratively Improving Spark Application Performance
SLIDE 2 Forecast
- Background: Spark, RDDs, and
Spark’s execution model
- Case study overview
- Improving our prototype
SLIDE 3
Background
SLIDE 4 Apache Spark
- Introduced in 2009; donated to
Apache in 2013; 1.0 release in 2014
- Based on a fundamental abstraction:
the resilient distributed dataset
- Supports in-memory computing and
a wide range of algorithms
SLIDE 5 Spark is general Spark core
SLIDE 6 Spark is general Spark core Graph
SLIDE 7 Spark is general Spark core Graph SQL
SLIDE 8 Spark is general Spark core Graph SQL ML
SLIDE 9 Spark is general Spark core Graph SQL ML Streaming
SLIDE 10 Spark is general Spark core Graph SQL ML Streaming ad hoc Mesos YARN
SLIDE 11 Spark is general Spark core Graph SQL ML Streaming ad hoc Mesos YARN
APIs for SCALA, Java, Python, and R (3rd-party bindings for Clojure et al.)
SLIDE 12 Resilient distributed datasets
SLIDE 13 Resilient distributed datasets
SLIDE 14 Resilient distributed datasets
SLIDE 15 Resilient distributed datasets
SLIDE 16 Resilient distributed datasets
Partitioned across machines by range…
SLIDE 17 Resilient distributed datasets
Partitioned across machines by range…
SLIDE 18 Resilient distributed datasets
Partitioned across machines by range… …or BY HASH
SLIDE 19 Resilient distributed datasets
SLIDE 20 Resilient distributed datasets
Failures mean Partitions can disappear… ?
SLIDE 21 Resilient distributed datasets
Failures mean Partitions can disappear… …but they can be reconstructed!
SLIDE 22 RDDs are partitioned, immutable, lazy collections
SLIDE 23 RDDs are partitioned, immutable, lazy collections
TRANSFORMATIONS create new RDDs that encode a dependency DAG ACTIONS result in executing cluster jobs & return values to the driver
SLIDE 24 Creating RDDs
- From a collection: parallelize()
- …a local or remote file: textFile()
- …or HDFS: hadoopFile();
sequenceFile(); objectFile()
SLIDE 25 RDD[T] transformations
- map(f: T=>U): RDD[U]
- flatMap(f: T=>Seq[U]): RDD[U]
- filter(f: T=>Boolean): RDD[T]
- distinct(): RDD[T]
- keyBy(f: T=>K): RDD[(K, T)]
SLIDE 26 RDD[(K,V)] transformations
- sortByKey(): RDD[(K,V)]
- groupByKey(): RDD[(K,Seq[V])]
- reduceByKey(f: (V,V)=>V):
RDD[(K,V)]
RDD[(K,(V,W))]
SLIDE 27 RDD[(K,V)] transformations
- sortByKey(): RDD[(K,V)]
- groupByKey(): RDD[(K,Seq[V])]
- reduceByKey(f: (V,V)=>V):
RDD[(K,V)]
RDD[(K,(V,W))]
…and many Others!
SLIDE 28 RDD[T] actions
- collect(): Array[T]
- count(): Long
- reduce(f: (T,T)=>T): T
- saveAsTextFile(path)
- saveAsSequenceFile(path)
SLIDE 29 RDD[T] actions
- collect(): Array[T]
- count(): Long
- reduce(f: (T,T)=>T): T
- saveAsTextFile(path)
- saveAsSequenceFile(path)
(Remember: ALL actions are eager) …and many Others!
SLIDE 30 Example: word count in Spark
val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
SLIDE 31 Example: word count in Spark
val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
SLIDE 32 Example: word count in Spark
val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
SLIDE 33 Example: word count in Spark
val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
SLIDE 34 Example: word count in Spark
val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
SLIDE 35 Example: word count in Spark
val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
SLIDE 36
Case study: bicycling data
SLIDE 37 Metrics available to cyclists
direct derivative or lagging effort cadence torque power heart rate calories training stress course position elevation temperature grade speed distance
SLIDE 38 Metrics available to cyclists
direct derivative or lagging effort cadence torque power heart rate calories training stress course position elevation temperature grade speed distance
(Typically, One sample per second)
SLIDE 39 Metrics available to cyclists
direct derivative or lagging effort cadence torque power heart rate calories training stress course position elevation temperature grade speed distance
(Typically, One sample per second)
SLIDE 40
“Where should I do intervals?”
SLIDE 41 405’ 3,970’ 10.5 miles
- Mt. Diablo (from North gate)
Near Walnut Creek, CA 890’ 16.8 miles 1,214’ “BUDDHA’s PALM” near Cross Plains, WI
SLIDE 42 Finding best efforts
A four-minute all-out effort includes sixty strong three-minute efforts.
913’ 956’ 1,156’ 1.2 miles TWIN VALLEY TIME TRIAL Middleton, WI (0.8 miles at ~5.7%)
SLIDE 43 Middleton Wauna Verona
US 12 US 12
Start by finding CLUSTERS
SLIDE 44 Middleton Wauna Verona
US 12 US 12
Start by finding CLUSTERS
…and then consider only the best effort between a pair of clusters!
SLIDE 45 Middleton Waunak Verona
US 12 US 12
V e r
a R
d
MS
SLIDE 46 Middleton Waunak Verona
US 12 US 12
V e r
a R
d
MS
Local Criterium Course Highest Point in southern WIsconsin
SLIDE 47 Middleton Waunak Verona
US 12 US 12
V e r
a R
d
MS
Local Criterium Course Highest Point in southern WIsconsin
CLIMBING section from WI IRONMAN
SLIDE 48 Middleton Waunak Verona
US 12 US 12
V e r
a R
d
MS
Local Criterium Course Highest Point in southern WIsconsin
CLIMBING section from WI IRONMAN TWIN VALLEY
SLIDE 49
The prototype
SLIDE 50 Windowed processing
20140909.tcx
SLIDE 51 Windowed processing
20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)
. . .
SLIDE 52 Windowed processing
20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)
. . .
Keep & PLOT the best windows for each spatial cluster pair!
start: cluster 0 end: cluster 1 start: cluster 1 end: cluster 2 start: cluster 2 end: cluster 0
SLIDE 53 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp }
SLIDE 54 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data: RDD[TP]
SLIDE 55 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data: RDD[TP] period: Int
SLIDE 56 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data: RDD[TP] period: Int xform: (TP => U) = identity _
SLIDE 57 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) data: RDD[TP] period: Int xform: (TP => U) = identity _
SLIDE 58 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap data: RDD[TP] period: Int xform: (TP => U) = identity _
SLIDE 59 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } data: RDD[TP] period: Int xform: (TP => U) = identity _
SLIDE 60 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} def windowsForActivities[U](data: RDD[TP], period: Int, xform: (TP => U) = identity _) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity:String, stp:Seq[TP]) => (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } } } def identity(tp: Trackpoint) = tp } data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap (stp sliding period).zipWithIndex.map { case (s,i) => ((activity, i), s.map(xform)) } data: RDD[TP] period: Int xform: (TP => U) = identity _
Transform an RDD of TRACKPOINTS… …TO an RDD of WINDOW IDS and SAMPLE WINDOWS
SLIDE 61 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _).cache val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) } val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) } // continued...
SLIDE 62 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _).cache val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) } val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) } // continued... val windowedSamples = windowsForActivities(data, period, minify _).cache
Divide the input data into
- verlapping windows, keyed
by ACTIVITY and OFFSET (we’ll call this key a WINDOW ID)
SLIDE 63 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _).cache val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) } val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) } // continued... val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) }
Identify the spatial clusters that each window starts and ends in
SLIDE 64 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _).cache val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) } val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) } // continued... val mmps = windowedSamples.map { case ((act, idx), samples) => ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) }
Identify the MEAN WATTAGE for each window of samples
SLIDE 65 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }
SLIDE 66 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } mmps.join(clusterPairs)
for each window ID, JOIN its mean wattages with its spatial clusters
SLIDE 67 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) }
transpose these tuples so they are keyed by spatial cluster pairs
SLIDE 68 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }
KEEP ONLY the BEST wattage for each spatial cluster pair
.reduceByKey ((a, b) => if (a._1 > b._1) a else b)
SLIDE 69 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) }
project away the cluster centers
SLIDE 70 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .sortByKey(false) .take(20)
SORT by wattage in descending
- rder; keep the best twenty
SLIDE 71 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }
Re-key the best efforts by window id
.map { case (watts, (act, idx)) => ((act, idx), watts) }
SLIDE 72 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) }
get the actual sample windows for each effort; project away IDs
SLIDE 73 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = /* window IDs and raw sample windows */ val clusterPairs = /* window IDs and spatial cluster pairs */ val mmps = /* window IDs and mean wattages for each window */ val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) app.context.parallelize(top20) .map { case (watts, (act, idx)) => ((act, idx), watts) } .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } .collect
SLIDE 74
Improving the prototype
SLIDE 75
Broadcast large static data
SLIDE 76 Broadcast variables
// phoneBook maps (given name, surname) -> phone number digits val phoneBook: Map[(String, String), String] = initPhoneBook() val names: RDD[(String, String)] = /* ... */ val directory = names.map { case name @ (first, last) => (name, phoneBook.getOrElse("555-1212")) }
SLIDE 77 Broadcast variables
// phoneBook maps (given name, surname) -> phone number digits val phoneBook: Map[(String, String), String] = initPhoneBook() val names: RDD[(String, String)] = /* ... */ val directory = names.map { case name @ (first, last) => (name, phoneBook.getOrElse("555-1212")) }
phoneBook will be copied and deserialized for each task!
SLIDE 78 Broadcast variables
// phoneBook maps (given name, surname) -> phone number digits val phoneBook: Map[(String, String), String] = initPhoneBook() val names: RDD[(String, String)] = /* ... */ val pbb = sparkContext.broadcast(phoneBook) val directory = names.map { case name @ (first, last) => (name, pbb.value.getOrElse(“555-1212")) }
Broadcasting phoneBook means it can be deserialized once and cached on each Node!
SLIDE 79 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period, minify _) val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model)) // ... }
SLIDE 80 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val clusterPairs = windowedSamples.map { case ((act, idx), samples) => ((act, idx), clusterPairsForWindow(samples, model.value)) // rest of function unchanged }
SLIDE 81
Cache only when necessary
SLIDE 82 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period).cache // ... val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) // ... }
SLIDE 83 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period).cache // ... val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) // ... }
Keeping EVERY WINDOW IN MEMORY…
SLIDE 84 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period).cache // ... val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) // ... }
Keeping EVERY WINDOW IN MEMORY… …EVEN THOUGH RECOMPUTING windows IS incredibly CHEAP AND YOU’ll need
- nly a tiny fraction of windows?
SLIDE 85 def bestsForPeriod(data: RDD[TP], period: Int, app: SLP, model: KMeansModel) = { val windowedSamples = windowsForActivities(data, period).cache // ... val top20 = mmps.join(clusterPairs) .map { case ((act, idx), (watts, (cls1, cls2))) => ((cls1, cls2), (watts, (act, idx))) } .reduceByKey ((a, b) => if (a._1 > b._1) a else b) .map { case ((cls1, cls2), (watts, (act, idx))) => (watts, (act, idx)) } .sortByKey(false) .take(20) // ... }
Eliminating unnecessary memory pressure can lead to a substantial speedup!
SLIDE 86
Avoid shuffles when possible
SLIDE 89 stages we want to Avoid all unnecessary shuffles
SLIDE 90 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }
SLIDE 91 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }
SLIDE 92 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }
start and end clusters
SLIDE 93 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }
start and end clusters window ids and mean wattages
SLIDE 94 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep }
eliminate a join and a transpose
SLIDE 95 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect }
(use the right API calls!)
.takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }
SLIDE 96 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect
eliminate a transpose
SLIDE 97 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val windowedSamples = windowsForActivities(data, period, minify _) val bests = windowedSamples.map { case ((act, idx), samples) => ( clusterPairsForWindow(samples, model.value), ((act, idx), samples.map(_.watts).reduce(_ + _) / samples.size) ) }.cache val top20 = bests.reduceByKey ((a, b) => if (a._2 > b._2) a else b) .map { case ((_, _), keep) => keep } .takeOrdered(20)(Ordering.by[((String, Int), Double), Double] { case ((_, _), watts) => -watts }) app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect } app.context.parallelize(top20) .join (windowedSamples) .map { case ((act, idx), (watts, samples)) => (watts, samples) } .collect
eliminate a transpose
app.context.parallelize(top20) .join (windowedSamples) .collect .map { case ((act, idx), (watts, samples)) => (watts, samples) }
…or two!
SLIDE 98
Embrace laziness
SLIDE 99 Embrace laziness
(only pay for what you use)
SLIDE 100 Windowed processing redux
20140909.tcx
SLIDE 101 Windowed processing redux
20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)
. . .
SLIDE 102 Windowed processing redux
20140909.tcx (20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)
. . .
start: cluster 0 end: cluster 1 start: cluster 1 end: cluster 2 start: cluster 2 end: cluster 0
SLIDE 103 Lazy windowed processing
20140909.tcx
SLIDE 104 Lazy windowed processing
20140909.tcx
(20140909.tcx, 0) (20140909.tcx, 1) (20140909.tcx, 2) (20140909.tcx, 3) (20140909.tcx, 14)
. . .
watts: begin: 0:03 end: 0:32 20140909.tcx watts: begin: 0:02 end: 0:31 20140909.tcx watts: begin: 0:01 end: 0:30 20140909.tcx watts: begin: 0:00 end: 0:29 20140909.tcx watts: begin: 54:17 end: 54:46 20140909.tcx
SLIDE 105 Lazy windowed processing
20140909.tcx
. . .
start: cluster 0 end: cluster 1 start: cluster 1 end: cluster 2 start: cluster 2 end: cluster 0 watts: begin: 0:03 end: 0:32 20140909.tcx watts: begin: 0:02 end: 0:31 20140909.tcx watts: begin: 0:01 end: 0:30 20140909.tcx watts: begin: 0:00 end: 0:29 20140909.tcx watts: begin: 54:17 end: 54:46 20140909.tcx start: cluster 0 end: cluster 1 start: cluster 1 end: cluster 2
SLIDE 106 Lazy windowed processing
20140909.tcx
. . .
start: cluster 0 end: cluster 1 start: cluster 2 end: cluster 0 watts: begin: 0:03 end: 0:32 20140909.tcx watts: begin: 0:01 end: 0:30 20140909.tcx watts: begin: 54:17 end: 54:46 20140909.tcx start: cluster 1 end: cluster 2
SLIDE 107 Lazy windowed processing
20140909.tcx
. . .
start: cluster 0 end: cluster 1 start: cluster 2 end: cluster 0 watts: begin: 0:03 end: 0:32 20140909.tcx watts: begin: 0:01 end: 0:30 20140909.tcx watts: begin: 54:17 end: 54:46 20140909.tcx start: cluster 1 end: cluster 2
SLIDE 108 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} // ... def applyWindowedNoZip[U: ClassTag](data: RDD[TP], period: Int, xform: ((String, Seq[TP]) => U)) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity: String, stp:Seq[TP]) => (stp sliding period).map { xform(activity, _) } } } }
SLIDE 109 trait ActivitySliding { import org.apache.spark.rdd.RDD import com.freevariable.surlaplaque.data.{Trackpoint => TP} // ... def applyWindowedNoZip[U: ClassTag](data: RDD[TP], period: Int, xform: ((String, Seq[TP]) => U)) = { val pairs = data.groupBy((tp:TP) => tp.activity.getOrElse("UNKNOWN")) pairs.flatMap { case (activity: String, stp:Seq[TP]) => (stp sliding period).map { xform(activity, _) } } } }
Perform Arbitrary TRANSFORMations ON each window of samples
SLIDE 110 case class Effort(mmp: Double, activity: String, startTimestamp: Long, endTimestamp: Long) {} Model EFFORT SUMMARIES as simple, lightweight records
SLIDE 111 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = applyWindowedNoZip(data, period, { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }) // continued }
SLIDE 112 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = applyWindowedNoZip(data, period, { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }) // continued } { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }
SLIDE 113 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = applyWindowedNoZip(data, period, { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }) // continued } { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }
SLIDE 114 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = applyWindowedNoZip(data, period, { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }) // continued } { case (activity:String, samples:Seq[Trackpoint]) => ( clusterPairsForWindow(samples, model.value), Effort(samples.map(_.watts).reduce(_ + _) / samples.size, activity, samples.head.timestamp, samples.last.timestamp) ) }
SLIDE 115 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } }
SLIDE 116 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b)
SLIDE 117 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp })
SLIDE 118 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) }
SLIDE 119 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) }
SLIDE 120 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) }
SLIDE 121 def bestsForPeriod(data: RDD[Trackpoint], period: Int, app: SLP, model: Broadcast[KMeansModel]) = { val clusteredMMPs = /* map from cluster pairs to effort summary structures */ clusteredMMPs .reduceByKey ((a, b) => if (a.mmp > b.mmp) a else b) .takeOrdered(20)(Ordering.by[((Int, Int), Effort), Double] { case (_, e:Effort) => -e.mmp }) .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) } } .map { case (_, e: Effort) => ( e.mmp, data.filter { case tp: Trackpoint => tp.activity.getOrElse("UNKNOWN") == e.activity && tp.timestamp <= e.endTimestamp && tp.timestamp >= e.startTimestamp }.collect ) }
SLIDE 122
Conclusions
SLIDE 123 Avoid shuffles when possible
2x
SLIDE 124 Broadcast large static data
1.1x
Avoid shuffles when possible
2x
SLIDE 125 Broadcast large static data
1.1x
Cache only when necessary
2x
Avoid shuffles when possible
2x
SLIDE 126 Broadcast large static data
1.1x
Cache only when necessary
2x
Avoid shuffles when possible
2x
Embrace laziness
(only pay for what you use)
14x
SLIDE 127 Broadcast large static data
1.1x
Cache only when necessary
2x
Avoid shuffles when possible
2x
Embrace laziness
(only pay for what you use)
14x
Work WITH Spark, not against it
SLIDE 128 Thanks!
willb@redhat.com http://chapeau.freevariable.com @willb