apache spark a unified engine for big data processing
play

Apache Spark: A Unified Engine for Big Data Processing Presented - PowerPoint PPT Presentation

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A


  1. Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen

  2. Apache Spark: A Unified Engine for Big Data Processing § Engine? § Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2

  3. Apache Spark: A Unified Engine for Big Data Processing § Engine? § convert one form of data into other useful forms § Unified? § Multiple types of conversions Apache Spark: A Unified Engine for Big Data Processing PAGE 3

  4. Apache Spark: A Unified Engine for Big Data Processing § What is Apache Spark? (Engine) § How can it make multiple types of conversions over big data? (Unified) Apache Spark: A Unified Engine for Big Data Processing PAGE 4

  5. What is Apache Spark? A framework like MapReduce § Resilient Distributed Datasets (RDDs) § RDDs Apache Spark: A Unified Engine for Big Data Processing PAGE 5

  6. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 6

  7. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 7

  8. Resilient Distributed Datasets (RDDs) I/O I/O Apache Spark: A Unified Engine for Big Data Processing PAGE 8

  9. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 9

  10. Resilient Distributed Datasets (RDDs) An RDD is a read-only, partitioned collection of records § Transformations § § create RDDs (map, filter, join, etc.) Actions § § return a value to the application § or export data to a storage system Persistence § § Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). Partitioning § § Users can ask that an RDD’s elements be partitioned across machines based on a key in each record. Apache Spark: A Unified Engine for Big Data Processing PAGE 10

  11. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 11

  12. Resilient Distributed Datasets (RDDs) § Lineage § An RDD has enough information about how it was derived from other datasets. § Narrow dependencies § each partition of the parent RDD is used by at most one partition of the child RDD § Wide dependencies: § multiple child partitions Apache Spark: A Unified Engine for Big Data Processing PAGE 12

  13. Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 13

  14. Resilient Distributed Datasets (RDDs) Example: run an action on RDD G Apache Spark: A Unified Engine for Big Data Processing PAGE 14

  15. MapReduce vs Spark MapReduce Ecosystem Spark Ecosystem Apache Spark: A Unified Engine for Big Data Processing PAGE 15

  16. Higher-Level Libraries Apache Spark: A Unified Engine for Big Data Processing PAGE 16

  17. SQL and DataFrames Apache Spark: A Unified Engine for Big Data Processing PAGE 17

  18. SQL and DataFrames § !"#"$%"&'( = *!!( + ,-ℎ'&" = /"01'( § Spark SQL’s DataFrame API supports inline definition of user-defined functions (UDFs), without the complicated packaging and registration process found in other database systems. Apache Spark: A Unified Engine for Big Data Processing PAGE 18

  19. UDF in MySQL Apache Spark: A Unified Engine for Big Data Processing PAGE 19

  20. UDF in Spark SQL Apache Spark: A Unified Engine for Big Data Processing PAGE 20

  21. Spark Streaming Apache Spark: A Unified Engine for Big Data Processing PAGE 21

  22. Spark Streaming Discretized stream processing model Continuous operator processing model Apache Spark: A Unified Engine for Big Data Processing PAGE 22

  23. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 23

  24. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 24

  25. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 25

  26. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 26

  27. GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 27

  28. GraphX § Not able to beat specialized graph-parallel systems itself § But outperform them in graph analytics pipeline Apache Spark: A Unified Engine for Big Data Processing PAGE 28

  29. MLlib Apache Spark: A Unified Engine for Big Data Processing PAGE 29

  30. MLlib § More than 50 common algorithms for distributed model training § Support pipeline construction on Spark § Integrate with other Spark libraries well Apache Spark: A Unified Engine for Big Data Processing PAGE 30

  31. Why use Apache Spark? § Ecosystem § Competitive performance § Low cost in sharing data § Low latency of MapReduce Steps § Control over bottleneck resources Apache Spark: A Unified Engine for Big Data Processing PAGE 31

  32. Apache Spark in 2016 § Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs. § Apache Spark has grown to 1,000 contributors and thousands of deployments from 2010 to 2016. Apache Spark: A Unified Engine for Big Data Processing PAGE 32

  33. Apache Spark Today Apache Spark: A Unified Engine for Big Data Processing PAGE 33

  34. Apache Spark: A Unified Engine for Big Data Processing § What is Apache Spark § Apache Spark = MapReduce + RDDs § How can it make multiple types of conversions over big data § Higher-level libraries enable Apache Spark to handle different types of big data workload Apache Spark: A Unified Engine for Big Data Processing PAGE 34

  35. “Try Apache Spark if you are new to the big data processing world” Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing PAGE 35

  36. Q&A § What issues will it cause by persisting data in memory? For example, garbage collection? § What are Parallel Random Access Machine model and Bulk Synchronous Parallel model? Are these two models able to model any computation in distributed world? § Will optimizing one library cause other libraries to lose performance? § Is using memory as the storage really the next generation of storage? Apache Spark: A Unified Engine for Big Data Processing PAGE 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend