[PPT] - Starting with Apache Spark, Best Practices and Learning from the PowerPoint Presentation

SLIDE 1

SLIDE 2

Starting with Apache Spark, Best Practices and Learning from the Field

Felix Cheung, Principal Engineer + Spark Committer Spark@Microsoft

SLIDE 3

SLIDE 4

Best Practices Enterprise Solutions

SLIDE 5

SLIDE 6

SLIDE 7

Resilient - Fault tolerant

SLIDE 8

SLIDE 9

19,500+ commits

SLIDE 10

Tungsten AMPLab becoming RISELab

Drizzle – low latency execution, 3.5x lower than

Spark Streaming

Ernest – performance prediction, automatically

choose the optimal resource config on the cloud

SLIDE 11

SLIDE 12

Deployment Scheduler Resource Manager (aka Cluster Manager)

Spark History Server, Spark UI

Spark Core

SLIDE 13

SLIDE 14

Parallelization, Partition Transformation Action Shuffle

SLIDE 15

Doing multiple things at the same time

SLIDE 16

A unit of parallelization

SLIDE 17

Manipulating data - immutable "Narrow" "Wide"

SLIDE 18

SLIDE 19

SLIDE 20

Processing: sorting, serialize/deserialize, compression Transfer: disk IO, network bandwidth/latency T ake up memory, or spill to disk for intermediate results ("shuffle file")

SLIDE 21

Materialize results Execute the chain of transformations that leads to output – lazy evaluation count collect -> take write

SLIDE 22

DataFrame Dataset Data source Execution engine - Catalyst

SQL

SLIDE 23

Execution Plan Predicate Pushdown

SLIDE 24

Strong typing Optimized execution

SLIDE 25

Dataset[Row]

Partition = set of Row's

SLIDE 26

"format" - Parquet, CSV , JSON, or Cassandra, HBase

SLIDE 27

SLIDE 28

Ability to process expressions as early in the plan as possible

SLIDE 29

spark.read.jdbc(jdbcUrl, "food", connectionProperties) // with pushdown spark.read.jdbc(jdbcUrl, "food", connectionProperties).select("hotdog", "pizza", "sushi")

SLIDE 30

Discretized Streams (DStreams) Receiver DStream Direct DStream Basic and Advanced Sources

Streaming

SLIDE 31

Source Reliability Receiver + Write Ahead Log (WAL) Checkpointing

SLIDE 32

SLIDE 33

SLIDE 34

SLIDE 35

https://databricks.com/wp-content/uploads/2015/01/blog-ha-52.png

SLIDE 36

Only for reliable messaging sources that supports read from position Stronger fault-tolerance, exactly-once* No receiver/WAL – less resource, lower overhead

SLIDE 37

Saving to reliable storage to recover from failure

1. Metadata checkpointing

StreamingContext.checkpoint()

2. Data checkpointing

dstream.checkpoint()

SLIDE 38

ML Pipeline Transformer Estimator Evaluator

Machine Learning

SLIDE 39

DataFrame-based

leverage optimizations and support

transformations a sequence of algorithms

PipelineStages

SLIDE 40

Transformer Estimator Transformer DataFrame

Feature engineering Modeling

SLIDE 41

Feature transformer

take a DataFrame and its Column and

append one or more new Column

SLIDE 42

StopWordsRemover Binarizer SQLTransformer VectorAssembler

SLIDE 43

Estimators

An algorithm DataFrame -> Model A Model is a Transformer LinearRegression KMeans

SLIDE 44

Evaluator

Metric to measure Model performance on held-out test data

SLIDE 45

Evaluator

MulticlassClassificationEvaluator BinaryClassificationEvaluator RegressionEvaluator

SLIDE 46

MLWriter/MLReader

Pipeline persistence Include transformers, estimators, Params

SLIDE 47

Graph Pregel Graph Algorithms Graph Queries

Graph

SLIDE 48

Directed multigraph with user properties

n edges and vertices

SEA NYC LAX

SLIDE 49

PageRank ConnectedComponents

ranks = tripGraph.pageRank(resetProbability= 0.15, maxIter=5)

SLIDE 50

DataFrame-based Simplify loading graph data, wrangling Support Graph Queries

SLIDE 51

Pattern matching Mix pattern with SQL syntax

motifs = g.find("(a)-[e]->(b); (b)- [e2]->(a); !(c)-[]->(a)").filter("a.id = 'MIA'")

SLIDE 52

Structured Streaming Model Source Sink StreamingQuery

Structured Streaming

SLIDE 53

Extending same DataFrame to include incremental execution of unbounded input Reliability, correctness / exactly-once - checkpointing (2.1 JSON format)

SLIDE 54

Stream as Unbounded Input

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

SLIDE 55

Watermark (2.1) - handling of late data Streaming ETL, joining static data, partitioning, windowing

SLIDE 56

FileStreamSource KafkaSource MemoryStream (not for production) T extSocketSource MQTT

SLIDE 57

FileStreamSink (new formats in 2.1) ConsoleSink ForeachSink (Scala only) MemorySink – as T emp View

SLIDE 58

staticDF = ( spark .read .schema(jsonSchema) .json(inputPath) )

SLIDE 59

streamingDF = ( spark .readStream .schema(jsonSchema) .option("maxFilesPerTrigger", 1) .json(inputPath) ) # Take a list of files as a stream

SLIDE 60

streamingCountsDF = ( streamingDF .groupBy( streamingDF.word, window( streamingDF.time, "1 hour")) .count() )

SLIDE 61

query = ( streamingCountsDF .writeStream .format("memory") .queryName("word_counts") .outputMode("complete") .start() ) spark.sql("select count from word_counts order by time")

SLIDE 62

SLIDE 63

How much going in affects how much work it's going to take

SLIDE 64

Size does matter! CSV or JSON is "simple" but also tend to be big JSON-> Parquet (compressed)

7x faster

SLIDE 65

Format also does matter

Recommended format - Parquet Default data source/format

VectorizedReader
Better dictionary decoding

SLIDE 66

Parquet Columnar Format

Column chunk co-located Metadata and headers for skipping

SLIDE 67

Recommend Parquet

SLIDE 68

Compression is a factor

gzip <100MB/s vs snappy 500MB/s Tradeoffs: faster or smaller? Spark 2.0+ defaults to snappy

SLIDE 69

Sidenote: T able Partitioning

Storage data into groups of partitioning columns Encoded path structure matches Hive

table/event_date=2017-02-01

SLIDE 70

Spark UI Timeline view

https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

SLIDE 71

SLIDE 72

Spark UI DAG view

https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

SLIDE 73

Executor tab

SLIDE 74

SQL tab

SLIDE 75

SLIDE 76

Understanding Queries

explain() is your friend

but it could be hard to understand at times

== Parsed Logical Plan == Aggregate [count(1) AS count#79L] +- Sort [speed_y#49 ASC], true +- Join Inner, (speed_x#48 = speed_y#49) :- Project [speed#2 AS speed_x#48, dist#3] : +- LogicalRDD [speed#2, dist#3] +- Project [speed#18 AS speed_y#49, dist#19] +- LogicalRDD [speed#18, dist#19]

SLIDE 77

SLIDE 78

== Physical Plan == *HashAggregate(keys=[], functions=[count(1)], output=[count#79L]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)],

utput=[count#83L])

+- *Project +- *Sort [speed_y#49 ASC], true, 0 +- Exchange rangepartitioning(speed_y#49 ASC, 200) +- *Project [speed_y#49] +- *SortMergeJoin [speed_x#48], [speed_y#49], Inner :- *Sort [speed_x#48 ASC], false, 0 : +- Exchange hashpartitioning(speed_x#48, 200) : +- *Project [speed#2 AS speed_x#48] : +- *Filter isnotnull(speed#2) : +- Scan ExistingRDD[speed#2,dist#3] +- *Sort [speed_y#49 ASC], false, 0 +- Exchange hashpartitioning(speed_y#49, 200) +- *Project [speed#18 AS speed_y#49]

SLIDE 79

UDF

Write you own custom transforms But... Catalyst can't see through it (yet?!) Always prefer to use builtin transforms as much as possible

SLIDE 80

UDF vs Builtin Example

Remember Predicate Pushdown?

val isSeattle = udf { (s: String) => s == "Seattle" } cities.where(isSeattle('name)) *Filter UDF(name#2) +- *FileScan parquet [id#128L,name#2] Batched: true, Format: ParquetFormat, InputPaths: file:/Users/b/cities.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint,name:string>

SLIDE 81

UDF vs Builtin Example

cities.where('name === "Seattle") *Project [id#128L, name#2] +- *Filter (isnotnull(name#2) && (name#2 = Seattle)) +- *FileScan parquet [id#128L,name#2] Batched: true, Format: ParquetFormat, InputPaths: file:/Users/b/cities.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Seattle)], ReadSchema: struct<id:bigint,name:string>

SLIDE 82

UDF in Python

Avoid! Why? Pickling, transfer, extra memory to run Python interpreter

Hard to debug errors!

from pyspark.sql.types import IntegerType sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) sqlContext.sql("SELECT stringLengthInt('test')").take(1)

SLIDE 83

Going for Performance

Stored in compressed Parquet Partitioned table Predicate Pushdown Avoid UDF

SLIDE 84

Shuffling for Join

Can be very expensive

SLIDE 85

Optimizing for Join

Partition! Narrow transform if left and right partitioned with same scheme

SLIDE 86

Optimizing for Join

Broadcast Join (aka Map-Side Join in Hadoop) Smaller table against large table - avoid shuffling large table Default 10MB auto broadcast

SLIDE 87

BroadcastHashJoin

left.join(right, Seq("id"), "leftanti").explain == Physical Plan == *BroadcastHashJoin [id#50], [id#60], LeftAnti, BuildRight :- LocalTableScan [id#50, left#51] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id#60]

SLIDE 88

Repartition

T

numPartitions or by Columns

Increase parallelism – will shuffle

coalesce() – combine partitions in place

SLIDE 89

Cache cache() or persist()

Flush least-recently-used (LRU)

Make sure there is enough memory!

MEMORY_AND_DISK to avoid expensive

recompute (but spill to disk is slow)

SLIDE 90

Streaming

Use Structured Streaming (2.1+) If not... If reliable messaging (Kafka) use Direct DStream

SLIDE 91

Metadata - Config Position from streaming source (aka

ffset)
could get duplicates! (at-least-once)

Pending batches

SLIDE 92

Persist stateful transformations

data lost if not saved

Cut short execution that could grow indefinitely

SLIDE 93

Direct DStream

Checkpoint also store offset Turn off auto commit

do when in good state for exactly-
nce

SLIDE 94

Checkpointing

Stream/ML/Graph/SQL

more efficient indefinite/iterative
recovery

Generally not versioning-safe Use reliable distributed file system

(caution on “object store”)

SLIDE 95

SLIDE 96

SLIDE 97

Hadoop WebLog Spark SQL External Data Source BI T

ols

HDFS Hive Hive Metastore FrontEnd ntEnd Hourly ly

SLIDE 98

Spark Streaming Spark ML Kafka HDFS FrontEnd ntEnd Ne Near-RealT RealTime ime (e (end nd-to to-end end round ndtrip: ip: 8-20 20 sec) Offline Analysis

SLIDE 99

SLIDE 100

Spark SQL RDBMS BI T

ols

Hive SQL Appliance BI T

ols

SLIDE 101

SLIDE 102

Spark Streaming Message Bus Spark ML Kafka Storage Spark SQL

Visualization

External Data Source BI T

ols

Data Lake Hive Metastore Spark ML

SLIDE 103

SLIDE 104

Spark Streaming Kafka HDFS Spark SQL

Visualization

Presto BI T

ols

Hive SQL Flume SQL Data Science Notebook

SLIDE 105

SLIDE 106

Spark Streaming Message Bus Storage SQL Spark SQL Spark SQL Data Factory

SLIDE 107

Moving

SLIDE 108

SLIDE 109