Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for - - PowerPoint PPT Presentation

spark
SMART_READER_LITE
LIVE PREVIEW

Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for - - PowerPoint PPT Presentation

Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for iterative jobs Reuse of intermediate results without writing to disk Lineage-based fault tolerance Does not require checkpointing all intermediate results 3 3


slide-1
SLIDE 1

Spark

Marco Serafini

COMPSCI 532 Lecture 4

slide-2
SLIDE 2
slide-3
SLIDE 3

33

Goals

  • Support for iterative jobs
  • Reuse of intermediate results without writing to disk
  • Lineage-based fault tolerance
  • Does not require checkpointing all intermediate results
slide-4
SLIDE 4

4

Resilient Data Sets

  • Collection of records (serialized objects)
  • Read-only
  • Created through deterministic transformations from
  • Data in storage
  • Other RDDs
  • Lineage of an RDD
  • Sequence of transformations that create it
  • Replayed (in parallel) from persisted data to recreate lost RDD
  • Caching an RDD: keeping it in memory for later
slide-5
SLIDE 5

55

Spark terminology

  • Driver
  • Process executing the application code
  • Sends RDD transformations and actions to workers
  • Workers
  • Host partitions of RDDs
  • Execute transformations
slide-6
SLIDE 6

66

More Terminology

  • Task
  • Unit of work that will be sent to one executor
  • Partition of fundamental operator, such as Map and Reduce
  • Stage
  • Set of parallel tasks one task per partition
  • Can include multiple pipelined tasks with no shuffling
  • Job
  • Parallel computation consisting of multiple stages
  • Spawned in response to an action
slide-7
SLIDE 7

77

More terminology

  • Shuffle
  • Data transfer among workers
  • Partition
  • Worker thread
  • Transformation
  • Function that produces new RDD but no output
  • Lazily evaluated
  • Actions
  • Function that returns output
  • Triggers evaluation
slide-8
SLIDE 8

88

Spark API

slide-9
SLIDE 9

99

Spark Computation

  • Driver executes application code
  • Workers execute only transformations
  • Driver sends closure to workers
  • Lazy evaluation
  • Driver records transformations without

executing them

  • It builds a a Directed Acyclic Graph (DAG)
  • f transformations
  • Execute only as needed when output

(action) required

slide-10
SLIDE 10

10

10

Closures

  • Driver serializes functions to be executed by workers
  • It computes a “closure” and sends it to the workers
  • Closure includes all objects referenced by the function
  • Careful with references! Or you might send huge closures
slide-11
SLIDE 11

11

11

Narrow vs. Wide Operators

  • Narrow dependency
  • Executes on data local to the same worker
  • Can be pipelined locally
  • Faster recovery (only local re-execution needed)
  • Example: map à filter
  • Wide dependency
  • Requires a shuffle, which marks the end of a stage
  • Complex recovery (multi-worker)
  • Example: map à groupByKey (if don’t partition by that key)
slide-12
SLIDE 12

12

12

Checkpoints and Partitioning

  • Partitioning function declared when RDD created
  • Checkpointing
  • Speeds up recovery of lineage
  • When to checkpoint is left to the user
  • Checkpoint is stored (replicated) on file system
slide-13
SLIDE 13

13

13

PageRank Example

slide-14
SLIDE 14

14

14

Questions

  • Which intermediate RDDs are created?
  • Stages? Narrow/wide operators?
  • How to reduce shuffling?
  • Lineage graph
slide-15
SLIDE 15

15

15

Lineage Graph