CS 744: Resilient Distributed Datasets
Shivaram Venkataraman Fall 2020
CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall - - PowerPoint PPT Presentation
CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA , posted some to notes y submission - Assignment 1: Due Sep 21, Monday at 10pm! Piazza ! - Assignment 2: ML will be released Sep 22 Course - Final
CS 744: Resilient Distributed Datasets
Shivaram Venkataraman Fall 2020
ADMINISTRIVIA
y
, posted some submission notes to Piazza ! CourseMOTIVATION: Programmability
Most real applications require multiple MR steps – Google indexing pipeline: 21 steps – Analytics queries (e.g. sessions, top K): 2-5 steps – Iterative algorithms (e.g. PageRank): 10’s of steps Multi-step jobs create spaghetti code – 21 MR steps à 21 mapper and reducer classes
Discussion
.Apt!
xi
MOTIVATION: Performance
MR only provides one pass of computation – Must write out data to file system in-between Expensive for apps that need to reuse data – Multi-step algorithms (e.g. PageRank) – Interactive data mining y
Latency
↳ amp
Programmability
Google MapReduce WordCount:
7
E
3
APACHE Spark Programmability
val val file = spark.textFile(“hdfs://...”) val val counts = file.flatMap(line => line.split(“ ”)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.save(“out.txt”)fewer
lines
② zsgtehisawenrseda
?①
code
hand
.function
Medway
, functions inline functionsq
soul
mean
④ www.wetwtfk
③ peak
toxoadfoftitewt.in setting
claiming safely
local
APACHE Spark
Programmability: clean, functional API – Parallel transformations on collections – 5-10x less code than MR – Available in Scala, Java, Python and R Performance – In-memory computing primitives – Optimization across operators
Spark Concepts
Resilient distributed datasets (RDDs) – Immutable, partitioned collections of objects – May be cached in memory for fast reuse Operations on RDDs – Transformations (build RDDs) – Actions (compute results) Restricted shared variables – Broadcast, accumulators
yonce
we create this we cannotchase
*
contentsL
> trackcharges
using lineage recordsruined
' # .T
i - :a:*
.:*: :
. Ii:
Example: Log Mining
Find error messages present in log files interactively (Example: HTTP server logs)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() Block 1 Block 2 Block 3 Worker Worker WorkerDriver
messages.filter(_.contains(“foo”)).count tasks results Base RDD Transformed RDD Action for C ) 2effete:
⇐→
RDDC string)(put
this in cache create gon RDP from agilefilter
, meth ,#
g courtHI
rare
*⇒river )
mainsheets
€
.handle
memory
µ dataExample: Log Mining
Find error messages present in log files interactively (Example: HTTP server logs)
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() Block 1 Block 2 Block 3 Worker Worker WorkerDriver
messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: search 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
E
÷÷÷
→ -Fault Recovery
messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))
HDFS File Filtered RDD Mapped RDD filter
(func = _.contains(...))
map
(func = _.split(...))
D
"
. "⇒
D
. . . . > DDI
Base cache (messages) RDD ( HDFD " narrowclefs
"⑦
transformpar# Hors
paadraj.nl?Yai!mIem
② ?
file
Transform : filterfunction
Other RDD Operations
Transformations (define a new RDD) map filter sample groupByKey reduceByKey cogroup flatMap union join cross mapValues ... Actions (output a result) collect reduce take fold count saveAsTextFile saveAsHadoopFile ...
DEPENDENCIES
MM.pe#ration
°
O
f
FILE
"
Pifermediate
Job Scheduler (1)
Captures RDD dependency graph Pipelines functions into “stages”
join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached partition
Driver ftp..es?5fim↳ w 2
I
filter , map
, mop→
Job Scheduler (2)
Cache-aware for data reuse, locality Partitioning-aware to avoid shuffles
join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached partition also
in MRE
Job
aborts
CHECKPOINTING
rdd = sc.parallelize(1 to 100, 2).map(x à 2*x) rdd.checkpoint()SuMMARY
Spark: Generalize MR programming model Support in-memory computations with RDDs Job Scheduler: Pipelining, locality-aware
reduce → why at Driver 2MapReduce
aviate
↳ 1£"
"
"" " mimic ScalaAPI
reduce →spark behaves like thisDISCUSSION
https://forms.gle/4JDXfpRuVaXmQHxD8
" '
""T
aq
.
paomptatim
faction
( like
, Count)Ds -
Driver ( H5t3t6)③I
= 15/
↳ bottleneck
⇐ ID
↳ ppartitions
rkeM .woyapoiffffmh
you tear we : d bytes .← reduction dothi?:p×dqtes=Driv#
inat
D-doi-hgredraBykeyq.dk
→ neg::#
Decor'I÷gamxID→Y:L
, #arena
tin
:%÷rdeW
"
peuilnkpm
" ppp
. seekWhen would reduction trees be better than using `reduce` in Spark? When would they not be ?
doing
work in stages →scheduling
, task creation shuffletramnitted
issmall
→ tree Reducemight
be slow .df
n
dstat → Tool Python←
↳ cpu , network disk untilname
D →
ppnfuthogg.gg
. enabled some job infoNEXT STEPS
Next week: Resource Management
Assignment 1 is due soon!
Review form ↳ when is MR better spark vice versamultiple
passes → Spark is better when. :::b:
vsFrequency of
memory speedfailure