Starting with Apache Spark, Best Practices and Learning from the - - PowerPoint PPT Presentation

starting with apache spark
SMART_READER_LITE
LIVE PREVIEW

Starting with Apache Spark, Best Practices and Learning from the - - PowerPoint PPT Presentation

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal Engineer + Spark Committer Spark@Microsoft Best Practices Enterprise Solutions Resilient - Fault tolerant 19,500+ commits Tungsten AMPLab


slide-1
SLIDE 1
slide-2
SLIDE 2

Starting with Apache Spark, Best Practices and Learning from the Field

Felix Cheung, Principal Engineer + Spark Committer Spark@Microsoft

slide-3
SLIDE 3
slide-4
SLIDE 4

Best Practices Enterprise Solutions

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

Resilient - Fault tolerant

slide-8
SLIDE 8
slide-9
SLIDE 9

19,500+ commits

slide-10
SLIDE 10

Tungsten AMPLab becoming RISELab

  • Drizzle – low latency execution, 3.5x lower than

Spark Streaming

  • Ernest – performance prediction, automatically

choose the optimal resource config on the cloud

slide-11
SLIDE 11
slide-12
SLIDE 12

Deployment Scheduler Resource Manager (aka Cluster Manager)

  • Spark History Server, Spark UI

Spark Core

slide-13
SLIDE 13
slide-14
SLIDE 14

Parallelization, Partition Transformation Action Shuffle

slide-15
SLIDE 15

Doing multiple things at the same time

slide-16
SLIDE 16

A unit of parallelization

slide-17
SLIDE 17

Manipulating data - immutable "Narrow" "Wide"

slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20

Processing: sorting, serialize/deserialize, compression Transfer: disk IO, network bandwidth/latency T ake up memory, or spill to disk for intermediate results ("shuffle file")

slide-21
SLIDE 21

Materialize results Execute the chain of transformations that leads to output – lazy evaluation count collect -> take write

slide-22
SLIDE 22

DataFrame Dataset Data source Execution engine - Catalyst

SQL

slide-23
SLIDE 23

Execution Plan Predicate Pushdown

slide-24
SLIDE 24

Strong typing Optimized execution

slide-25
SLIDE 25

Dataset[Row]

Partition = set of Row's

slide-26
SLIDE 26

"format" - Parquet, CSV , JSON, or Cassandra, HBase

slide-27
SLIDE 27
slide-28
SLIDE 28

Ability to process expressions as early in the plan as possible

slide-29
SLIDE 29

spark.read.jdbc(jdbcUrl, "food", connectionProperties) // with pushdown spark.read.jdbc(jdbcUrl, "food", connectionProperties).select("hotdog", "pizza", "sushi")

slide-30
SLIDE 30

Discretized Streams (DStreams) Receiver DStream Direct DStream Basic and Advanced Sources

Streaming

slide-31
SLIDE 31

Source Reliability Receiver + Write Ahead Log (WAL) Checkpointing

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35

https://databricks.com/wp-content/uploads/2015/01/blog-ha-52.png

slide-36
SLIDE 36

Only for reliable messaging sources that supports read from position Stronger fault-tolerance, exactly-once* No receiver/WAL – less resource, lower overhead

slide-37
SLIDE 37

Saving to reliable storage to recover from failure

  • 1. Metadata checkpointing

StreamingContext.checkpoint()

  • 2. Data checkpointing

dstream.checkpoint()

slide-38
SLIDE 38

ML Pipeline Transformer Estimator Evaluator

Machine Learning

slide-39
SLIDE 39

DataFrame-based

  • leverage optimizations and support

transformations a sequence of algorithms

  • PipelineStages
slide-40
SLIDE 40

Transformer Estimator Transformer DataFrame

Feature engineering Modeling

slide-41
SLIDE 41

Feature transformer

  • take a DataFrame and its Column and

append one or more new Column

slide-42
SLIDE 42

StopWordsRemover Binarizer SQLTransformer VectorAssembler

slide-43
SLIDE 43

Estimators

An algorithm DataFrame -> Model A Model is a Transformer LinearRegression KMeans

slide-44
SLIDE 44

Evaluator

Metric to measure Model performance on held-out test data

slide-45
SLIDE 45

Evaluator

MulticlassClassificationEvaluator BinaryClassificationEvaluator RegressionEvaluator

slide-46
SLIDE 46

MLWriter/MLReader

Pipeline persistence Include transformers, estimators, Params

slide-47
SLIDE 47

Graph Pregel Graph Algorithms Graph Queries

Graph

slide-48
SLIDE 48

Directed multigraph with user properties

  • n edges and vertices

SEA NYC LAX

slide-49
SLIDE 49

PageRank ConnectedComponents

ranks = tripGraph.pageRank(resetProbability= 0.15, maxIter=5)

slide-50
SLIDE 50

DataFrame-based Simplify loading graph data, wrangling Support Graph Queries

slide-51
SLIDE 51

Pattern matching Mix pattern with SQL syntax

motifs = g.find("(a)-[e]->(b); (b)- [e2]->(a); !(c)-[]->(a)").filter("a.id = 'MIA'")

slide-52
SLIDE 52

Structured Streaming Model Source Sink StreamingQuery

Structured Streaming

slide-53
SLIDE 53

Extending same DataFrame to include incremental execution of unbounded input Reliability, correctness / exactly-once - checkpointing (2.1 JSON format)

slide-54
SLIDE 54

Stream as Unbounded Input

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

slide-55
SLIDE 55

Watermark (2.1) - handling of late data Streaming ETL, joining static data, partitioning, windowing

slide-56
SLIDE 56

FileStreamSource KafkaSource MemoryStream (not for production) T extSocketSource MQTT

slide-57
SLIDE 57

FileStreamSink (new formats in 2.1) ConsoleSink ForeachSink (Scala only) MemorySink – as T emp View

slide-58
SLIDE 58

staticDF = ( spark .read .schema(jsonSchema) .json(inputPath) )

slide-59
SLIDE 59

streamingDF = ( spark .readStream .schema(jsonSchema) .option("maxFilesPerTrigger", 1) .json(inputPath) ) # Take a list of files as a stream

slide-60
SLIDE 60

streamingCountsDF = ( streamingDF .groupBy( streamingDF.word, window( streamingDF.time, "1 hour")) .count() )

slide-61
SLIDE 61

query = ( streamingCountsDF .writeStream .format("memory") .queryName("word_counts") .outputMode("complete") .start() ) spark.sql("select count from word_counts order by time")

slide-62
SLIDE 62
slide-63
SLIDE 63

How much going in affects how much work it's going to take

slide-64
SLIDE 64

Size does matter! CSV or JSON is "simple" but also tend to be big JSON-> Parquet (compressed)

  • 7x faster
slide-65
SLIDE 65

Format also does matter

Recommended format - Parquet Default data source/format

  • VectorizedReader
  • Better dictionary decoding
slide-66
SLIDE 66

Parquet Columnar Format

Column chunk co-located Metadata and headers for skipping

slide-67
SLIDE 67

Recommend Parquet

slide-68
SLIDE 68

Compression is a factor

gzip <100MB/s vs snappy 500MB/s Tradeoffs: faster or smaller? Spark 2.0+ defaults to snappy

slide-69
SLIDE 69

Sidenote: T able Partitioning

Storage data into groups of partitioning columns Encoded path structure matches Hive

table/event_date=2017-02-01

slide-70
SLIDE 70

Spark UI Timeline view

https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

slide-71
SLIDE 71
slide-72
SLIDE 72

Spark UI DAG view

https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

slide-73
SLIDE 73

Executor tab

slide-74
SLIDE 74

SQL tab

slide-75
SLIDE 75
slide-76
SLIDE 76

Understanding Queries

explain() is your friend

but it could be hard to understand at times

== Parsed Logical Plan == Aggregate [count(1) AS count#79L] +- Sort [speed_y#49 ASC], true +- Join Inner, (speed_x#48 = speed_y#49) :- Project [speed#2 AS speed_x#48, dist#3] : +- LogicalRDD [speed#2, dist#3] +- Project [speed#18 AS speed_y#49, dist#19] +- LogicalRDD [speed#18, dist#19]

slide-77
SLIDE 77
slide-78
SLIDE 78

== Physical Plan == *HashAggregate(keys=[], functions=[count(1)], output=[count#79L]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)],

  • utput=[count#83L])

+- *Project +- *Sort [speed_y#49 ASC], true, 0 +- Exchange rangepartitioning(speed_y#49 ASC, 200) +- *Project [speed_y#49] +- *SortMergeJoin [speed_x#48], [speed_y#49], Inner :- *Sort [speed_x#48 ASC], false, 0 : +- Exchange hashpartitioning(speed_x#48, 200) : +- *Project [speed#2 AS speed_x#48] : +- *Filter isnotnull(speed#2) : +- Scan ExistingRDD[speed#2,dist#3] +- *Sort [speed_y#49 ASC], false, 0 +- Exchange hashpartitioning(speed_y#49, 200) +- *Project [speed#18 AS speed_y#49]

slide-79
SLIDE 79

UDF

Write you own custom transforms But... Catalyst can't see through it (yet?!) Always prefer to use builtin transforms as much as possible

slide-80
SLIDE 80

UDF vs Builtin Example

Remember Predicate Pushdown?

val isSeattle = udf { (s: String) => s == "Seattle" } cities.where(isSeattle('name)) *Filter UDF(name#2) +- *FileScan parquet [id#128L,name#2] Batched: true, Format: ParquetFormat, InputPaths: file:/Users/b/cities.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint,name:string>

slide-81
SLIDE 81

UDF vs Builtin Example

cities.where('name === "Seattle") *Project [id#128L, name#2] +- *Filter (isnotnull(name#2) && (name#2 = Seattle)) +- *FileScan parquet [id#128L,name#2] Batched: true, Format: ParquetFormat, InputPaths: file:/Users/b/cities.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Seattle)], ReadSchema: struct<id:bigint,name:string>

slide-82
SLIDE 82

UDF in Python

Avoid! Why? Pickling, transfer, extra memory to run Python interpreter

  • Hard to debug errors!

from pyspark.sql.types import IntegerType sqlContext.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) sqlContext.sql("SELECT stringLengthInt('test')").take(1)

slide-83
SLIDE 83

Going for Performance

Stored in compressed Parquet Partitioned table Predicate Pushdown Avoid UDF

slide-84
SLIDE 84

Shuffling for Join

Can be very expensive

slide-85
SLIDE 85

Optimizing for Join

Partition! Narrow transform if left and right partitioned with same scheme

slide-86
SLIDE 86

Optimizing for Join

Broadcast Join (aka Map-Side Join in Hadoop) Smaller table against large table - avoid shuffling large table Default 10MB auto broadcast

slide-87
SLIDE 87

BroadcastHashJoin

left.join(right, Seq("id"), "leftanti").explain == Physical Plan == *BroadcastHashJoin [id#50], [id#60], LeftAnti, BuildRight :- LocalTableScan [id#50, left#51] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [id#60]

slide-88
SLIDE 88

Repartition

T

  • numPartitions or by Columns

Increase parallelism – will shuffle

coalesce() – combine partitions in place

slide-89
SLIDE 89

Cache cache() or persist()

Flush least-recently-used (LRU)

  • Make sure there is enough memory!

MEMORY_AND_DISK to avoid expensive

recompute (but spill to disk is slow)

slide-90
SLIDE 90

Streaming

Use Structured Streaming (2.1+) If not... If reliable messaging (Kafka) use Direct DStream

slide-91
SLIDE 91

Metadata - Config Position from streaming source (aka

  • ffset)
  • could get duplicates! (at-least-once)

Pending batches

slide-92
SLIDE 92

Persist stateful transformations

  • data lost if not saved

Cut short execution that could grow indefinitely

slide-93
SLIDE 93

Direct DStream

Checkpoint also store offset Turn off auto commit

  • do when in good state for exactly-
  • nce
slide-94
SLIDE 94

Checkpointing

Stream/ML/Graph/SQL

  • more efficient indefinite/iterative
  • recovery

Generally not versioning-safe Use reliable distributed file system

(caution on “object store”)

slide-95
SLIDE 95
slide-96
SLIDE 96
slide-97
SLIDE 97

Hadoop WebLog Spark SQL External Data Source BI T

  • ols

HDFS Hive Hive Metastore FrontEnd ntEnd Hourly ly

slide-98
SLIDE 98

Spark Streaming Spark ML Kafka HDFS FrontEnd ntEnd Ne Near-RealT RealTime ime (e (end nd-to to-end end round ndtrip: ip: 8-20 20 sec) Offline Analysis

slide-99
SLIDE 99
slide-100
SLIDE 100

Spark SQL RDBMS BI T

  • ols

Hive SQL Appliance BI T

  • ols
slide-101
SLIDE 101
slide-102
SLIDE 102

Spark Streaming Message Bus Spark ML Kafka Storage Spark SQL

Visualization

External Data Source BI T

  • ols

Data Lake Hive Metastore Spark ML

slide-103
SLIDE 103
slide-104
SLIDE 104

Spark Streaming Kafka HDFS Spark SQL

Visualization

Presto BI T

  • ols

Hive SQL Flume SQL Data Science Notebook

slide-105
SLIDE 105
slide-106
SLIDE 106

Spark Streaming Message Bus Storage SQL Spark SQL Spark SQL Data Factory

slide-107
SLIDE 107

Moving

slide-108
SLIDE 108
slide-109
SLIDE 109

https://www.linkedin.com/in/felix-cheung-b4067510 https://github.com/felixcheung