making big data processing simple with spark
play

Making Big Data Processing Simple with Spark Matei Zaharia - PowerPoint PPT Presentation

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast to process large datasets High-level


  1. Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015

  2. What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast to process large datasets • High-level APIs in Java, Scala, Python, R • Unified engine that can capture many workloads

  3. A Unified Engine Spark MLlib Spark SQL GraphX Streaming machine structured data graph learning real-time Spark

  4. A Large Community Contributors / Month to Spark 160 140 120 Contributors 100 Most active open source project for 80 big data 60 40 20 0 2010 2011 2012 2013 2014 2015

  5. Overview Why a unified engine? Spark programming model Built-in libraries Applications

  6. History: Cluster Computing 2004

  7. MapReduce A general engine for batch processing

  8. Beyond MapReduce MapReduce was great for batch processing, but users quickly needed to do more: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing Result: specialized systems for these workloads

  9. Big Data Systems Today Pregel Giraph Dremel Drill MapReduce Presto Impala S4 . . . Storm General batch Specialized systems processing for new workloads

  10. Problems with Specialized Systems More systems to manage, tune, deploy Can’t easily combine processing types • Even though most applications need to do this! • E.g. load data with SQL, then run machine learning In many cases, data transfer between engines is a dominant cost!

  11. Big Data Systems Today Pregel Giraph ? Dremel Drill MapReduce Presto Impala . . . Storm S4 General batch Unified engine Specialized systems processing for new workloads

  12. Overview Why a unified engine? Spark programming model Built-in libraries Applications

  13. Background Recall 3 workloads were issues for MapReduce: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing While these look di ff erent, all 3 need one thing that MapReduce lacks: e ff icient data sharing

  14. Data Sharing in MapReduce HDFS HDFS HDFS HDFS read write read write . . . . . . iter. 1 iter. 2 Input result 1 query 1 HDFS read result 2 query 2 query 3 result 3 Input . . . . . . Slow due to replication and disk I/O

  15. What We’d Like iter. 1 iter. 2 . . . . . . Input query 1 one-time processing query 2 query 3 Input Distributed . . . . . . memory 10-100x faster than network and disk

  16. Spark Programming Model Resilient Distributed Datasets (RDDs) • Collections of objects stored in RAM or disk across cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

  17. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) Worker results errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘\t’)[2]) tasks Block 1 Driver messages.cache() Action messages.filter(lambda s: “MySQL” in s).count() Cache 2 messages.filter(lambda s: “Redis” in s).count() Worker . . . Cache 3 Block 2 Worker Example: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data) Block 3

  18. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

  19. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

  20. Example: Logistic Regression 4000 3500 110 s / iteration Running Time (s) Running Time (s) 3000 2500 Hadoop 2000 1500 Spark 1000 500 first iteration 80 s 0 further iterations 1 s 1 5 10 20 30 Number of Iterations Number of Iterations

  21. On-Disk Performance Time to sort 100TB 2013 Record: 2100 machines Hadoop 72 minutes 2014 Record: 207 machines Spark 23 minutes Source: Daytona GraySort benchmark, sortbenchmark.org

  22. Libraries Built on Spark Spark MLlib Spark SQL GraphX Streaming machine structured data graph learning real-time Spark

  23. Combining Processing Types // Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

  24. Combining Processing Types Separate systems: query train ETL HDFS HDFS HDFS HDFS HDFS HDFS read write read write read write . . . Spark: query train ETL HDFS HDFS read write

  25. Response Time (sec) Response Time (sec) Performance vs Specialized Systems 10 20 30 40 50 0 Hive Impala (disk) SQL Impala (mem) Spark (disk) Spark (mem) Throughput (MB/s/node) Throughput (MB/s/node) 10 15 20 25 30 35 0 5 Streaming Storm Spark Response Time (min) Response Time (min) 10 20 30 40 50 60 0 Mahout ML GraphLab Spark

  26. Some Recent Additions DataFrame API (similar to R and Pandas) • Easy programmatic way to work with structured data R interface (SparkR) Machine learning pipelines (like SciKit-learn)

  27. Overview Why a unified engine? Spark programming model Built-in libraries Applications

  28. Spark Community Over 1000 deployments, clusters up to 8000 nodes Many talks online at spark-summit.org

  29. Top Applications Business Intelligence 68% Data Warehousing 52% Recommendation 44% Log Processing 40% User-Facing Services 36% Faud Detection / Security 29%

  30. Spark Components Used Spark SQL 69% 75% DataFrames 62% of users use more Spark Streaming 58% than one component MLlib + GraphX 58%

  31. Learn More Get started on your laptop: spark.apache.org Resources and MOOCs: sparkhub.databricks.com Spark Summit: spark-summit.org

  32. Thank You

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend