Apache Spark: A Unified Engine for Big Data Processing
Presented by: Huanyi Chen
Apache Spark: A Unified Engine for Big Data Processing Presented - - PowerPoint PPT Presentation
Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A
Presented by: Huanyi Chen
Apache Spark: A Unified Engine for Big Data Processing PAGE 2
§ convert one form of data into other useful forms
§ Multiple types of conversions
Apache Spark: A Unified Engine for Big Data Processing PAGE 3
Apache Spark: A Unified Engine for Big Data Processing PAGE 4
§
A framework like MapReduce
§
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 5
RDDs
Apache Spark: A Unified Engine for Big Data Processing PAGE 6
Apache Spark: A Unified Engine for Big Data Processing PAGE 7
Apache Spark: A Unified Engine for Big Data Processing PAGE 8
I/O I/O
Apache Spark: A Unified Engine for Big Data Processing PAGE 9
Apache Spark: A Unified Engine for Big Data Processing PAGE 10
§
An RDD is a read-only, partitioned collection of records
§
Transformations
§ create RDDs (map, filter, join, etc.)
§
Actions
§ return a value to the application § or export data to a storage system
§
Persistence
§ Users can indicate which RDDs they will reuse and choose a storage strategy for them
(e.g., in-memory storage). §
Partitioning
§ Users can ask that an RDD’s elements be partitioned across machines based on a key
in each record.
Apache Spark: A Unified Engine for Big Data Processing PAGE 11
Apache Spark: A Unified Engine for Big Data Processing PAGE 12
§ An RDD has enough information about how it was derived from
§ each partition of the parent RDD is used by at most one partition of
the child RDD § Wide dependencies:
§ multiple child partitions
Apache Spark: A Unified Engine for Big Data Processing PAGE 13
Apache Spark: A Unified Engine for Big Data Processing PAGE 14
Example: run an action on RDD G
Apache Spark: A Unified Engine for Big Data Processing PAGE 15
Apache Spark: A Unified Engine for Big Data Processing PAGE 16
Apache Spark: A Unified Engine for Big Data Processing PAGE 17
Apache Spark: A Unified Engine for Big Data Processing PAGE 18
Apache Spark: A Unified Engine for Big Data Processing PAGE 19
Apache Spark: A Unified Engine for Big Data Processing PAGE 20
Apache Spark: A Unified Engine for Big Data Processing PAGE 21
Continuous operator processing model
Discretized stream processing model
Apache Spark: A Unified Engine for Big Data Processing PAGE 22
Apache Spark: A Unified Engine for Big Data Processing PAGE 23
Apache Spark: A Unified Engine for Big Data Processing PAGE 24
Apache Spark: A Unified Engine for Big Data Processing PAGE 25
Apache Spark: A Unified Engine for Big Data Processing PAGE 26
Apache Spark: A Unified Engine for Big Data Processing PAGE 27
Apache Spark: A Unified Engine for Big Data Processing PAGE 28
Apache Spark: A Unified Engine for Big Data Processing PAGE 29
Apache Spark: A Unified Engine for Big Data Processing PAGE 30
§ Ecosystem § Competitive performance § Low cost in sharing data § Low latency of MapReduce Steps § Control over bottleneck resources
Apache Spark: A Unified Engine for Big Data Processing PAGE 31
§ Apache Spark applications range from finance to scientific data
processing and combine libraries for SQL, machine learning, and graphs.
§ Apache Spark has grown to 1,000 contributors and thousands of
deployments from 2010 to 2016.
Apache Spark: A Unified Engine for Big Data Processing PAGE 32
Apache Spark: A Unified Engine for Big Data Processing PAGE 33
§ Apache Spark = MapReduce + RDDs
§ Higher-level libraries enable Apache Spark to handle different types
Apache Spark: A Unified Engine for Big Data Processing PAGE 34
Apache Spark: A Unified Engine for Big Data Processing PAGE 35
Huanyi Chen
Apache Spark: A Unified Engine for Big Data Processing PAGE 36