resilient distributed datasets a fault tolerant
play

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for - PowerPoint PPT Presentation

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Zbigniew Chlebicki based on paper by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael


  1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Zbigniew Chlebicki based on paper by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica; University of California, Berkeley. Some images and code samples are from paper, presentation for NSDI or Spark Project website ( http://spark-project.org/ ).

  2. MapReduce in Hadoop

  3. Resilient Distributed Datasets (RDD) ● Immutable, partitioned collection of records ● Created by deterministic coarse-grained transformations ● Materialized on action ● Fault-tolerant through lineage ● Controllable persistence and partitioning

  4. Example: Log mining val file = spark.textFile(“hdfs://…”) val errors = file.filter( line => line.contains(“ERROR”) ).cache() // Count all the errors errors.count() // Count errors mentioning MySQL errors.filter(line => line.contains(“MySQL”)).count() // Fetch the MySQL errors as an array of strings errors.filter(line => line.contains(“MySQL”)).collect()

  5. Example: Logistic Regression val points = spark.textFile(…).map(parsePoint).cache() var w = Vector.random(D) // current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println(“Final separating plane: “ + w)

  6. Example: PageRank links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) }

  7. Representation abstract def compute(split: Split): Iterator[T] abstract val dependencies: List[spark.Dependency[_]] abstract def splits: Array[Split] val partitioner: Option[Partitioner] def preferredLocations(split: Split): Seq[String]

  8. Scheduling

  9. Evaluation: PageRank

  10. Scalability

  11. Fault Recovery (k-means)

  12. Behavior with Insufficient RAM (logistic regression)

  13. User Applications ● Conviva, data mining (40x speedup) ● Mobile Millenium, traffic modeling ● Twitter, spam classification ● ...

  14. Expressing other Models ● MapReduce, DryadLINQ ● Pregel graph processing ● Iterative MapReduce ● SQL

  15. Conclusion ● RDDs are efficient, general and fault-tolerant abstraction for cluster computing ● 20x faster then Hadoop for memory bound applications ● Can be used for interactive data mining ● Available as Open Source at http://spark-project.org

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend