spark
play

Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for - PowerPoint PPT Presentation

Spark Marco Serafini COMPSCI 532 Lecture 4 Goals Support for iterative jobs Reuse of intermediate results without writing to disk Lineage-based fault tolerance Does not require checkpointing all intermediate results 3 3


  1. Spark Marco Serafini COMPSCI 532 Lecture 4

  2. Goals • Support for iterative jobs • Reuse of intermediate results without writing to disk • Lineage-based fault tolerance • Does not require checkpointing all intermediate results 3 3

  3. Resilient Data Sets • Collection of records (serialized objects) • Read-only • Created through deterministic transformations from • Data in storage • Other RDDs • Lineage of an RDD • Sequence of transformations that create it • Replayed (in parallel) from persisted data to recreate lost RDD • Caching an RDD: keeping it in memory for later 4

  4. Spark terminology • Driver • Process executing the application code • Sends RDD transformations and actions to workers • Workers • Host partitions of RDDs • Execute transformations 5 5

  5. More Terminology • Task • Unit of work that will be sent to one executor • Partition of fundamental operator, such as Map and Reduce • Stage • Set of parallel tasks one task per partition • Can include multiple pipelined tasks with no shuffling • Job • Parallel computation consisting of multiple stages • Spawned in response to an action 6 6

  6. More terminology • Shuffle • Data transfer among workers • Partition • Worker thread • Transformation • Function that produces new RDD but no output • Lazily evaluated • Actions • Function that returns output • Triggers evaluation 7 7

  7. Spark API 8 8

  8. Spark Computation • Driver executes application code • Workers execute only transformations • Driver sends closure to workers • Lazy evaluation • Driver records transformations without executing them • It builds a a Directed Acyclic Graph (DAG) of transformations • Execute only as needed when output (action) required 9 9

  9. Closures • Driver serializes functions to be executed by workers • It computes a “closure” and sends it to the workers • Closure includes all objects referenced by the function • Careful with references! Or you might send huge closures 10 10

  10. Narrow vs. Wide Operators • Narrow dependency • Executes on data local to the same worker • Can be pipelined locally • Faster recovery (only local re-execution needed) • Example: map à filter • Wide dependency • Requires a shuffle, which marks the end of a stage • Complex recovery (multi-worker) • Example: map à groupByKey (if don’t partition by that key) 11 11

  11. Checkpoints and Partitioning • Partitioning function declared when RDD created • Checkpointing • Speeds up recovery of lineage • When to checkpoint is left to the user • Checkpoint is stored (replicated) on file system 12 12

  12. PageRank Example 13 13

  13. Questions • Which intermediate RDDs are created? • Stages? Narrow/wide operators? • How to reduce shuffling? • Lineage graph 14 14

  14. Lineage Graph 15 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend