Apache Spark: A Unified Engine for Big Data Processing Presented - - PowerPoint PPT Presentation

apache spark a unified engine for big data processing
SMART_READER_LITE
LIVE PREVIEW

Apache Spark: A Unified Engine for Big Data Processing Presented - - PowerPoint PPT Presentation

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A


slide-1
SLIDE 1

Apache Spark: A Unified Engine for Big Data Processing

Presented by: Huanyi Chen

slide-2
SLIDE 2

Apache Spark: A Unified Engine for Big Data Processing

§ Engine? § Unified?

Apache Spark: A Unified Engine for Big Data Processing PAGE 2

slide-3
SLIDE 3

Apache Spark: A Unified Engine for Big Data Processing

§ Engine?

§ convert one form of data into other useful forms

§ Unified?

§ Multiple types of conversions

Apache Spark: A Unified Engine for Big Data Processing PAGE 3

slide-4
SLIDE 4

Apache Spark: A Unified Engine for Big Data Processing

§ What is Apache Spark? (Engine) § How can it make multiple types of conversions over big

data? (Unified)

Apache Spark: A Unified Engine for Big Data Processing PAGE 4

slide-5
SLIDE 5

What is Apache Spark?

§

A framework like MapReduce

§

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 5

RDDs

slide-6
SLIDE 6

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 6

slide-7
SLIDE 7

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 7

slide-8
SLIDE 8

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 8

I/O I/O

slide-9
SLIDE 9

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 9

slide-10
SLIDE 10

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 10

§

An RDD is a read-only, partitioned collection of records

§

Transformations

§ create RDDs (map, filter, join, etc.)

§

Actions

§ return a value to the application § or export data to a storage system

§

Persistence

§ Users can indicate which RDDs they will reuse and choose a storage strategy for them

(e.g., in-memory storage). §

Partitioning

§ Users can ask that an RDD’s elements be partitioned across machines based on a key

in each record.

slide-11
SLIDE 11

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 11

slide-12
SLIDE 12

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 12

§ Lineage

§ An RDD has enough information about how it was derived from

  • ther datasets.

§ Narrow dependencies

§ each partition of the parent RDD is used by at most one partition of

the child RDD § Wide dependencies:

§ multiple child partitions

slide-13
SLIDE 13

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 13

slide-14
SLIDE 14

Resilient Distributed Datasets (RDDs)

Apache Spark: A Unified Engine for Big Data Processing PAGE 14

Example: run an action on RDD G

slide-15
SLIDE 15

MapReduce Ecosystem Spark Ecosystem

MapReduce vs Spark

Apache Spark: A Unified Engine for Big Data Processing PAGE 15

slide-16
SLIDE 16

Higher-Level Libraries

Apache Spark: A Unified Engine for Big Data Processing PAGE 16

slide-17
SLIDE 17

SQL and DataFrames

Apache Spark: A Unified Engine for Big Data Processing PAGE 17

slide-18
SLIDE 18

SQL and DataFrames

Apache Spark: A Unified Engine for Big Data Processing PAGE 18

§ !"#"$%"&'( = *!!( + ,-ℎ'&" = /"01'( § Spark SQL’s DataFrame API supports inline definition of

user-defined functions (UDFs), without the complicated packaging and registration process found in other database systems.

slide-19
SLIDE 19

UDF in MySQL

Apache Spark: A Unified Engine for Big Data Processing PAGE 19

slide-20
SLIDE 20

UDF in Spark SQL

Apache Spark: A Unified Engine for Big Data Processing PAGE 20

slide-21
SLIDE 21

Spark Streaming

Apache Spark: A Unified Engine for Big Data Processing PAGE 21

slide-22
SLIDE 22

Continuous operator processing model

Discretized stream processing model

Spark Streaming

Apache Spark: A Unified Engine for Big Data Processing PAGE 22

slide-23
SLIDE 23

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 23

slide-24
SLIDE 24

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 24

slide-25
SLIDE 25

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 25

slide-26
SLIDE 26

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 26

slide-27
SLIDE 27

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 27

slide-28
SLIDE 28

GraphX

Apache Spark: A Unified Engine for Big Data Processing PAGE 28

§ Not able to beat specialized graph-parallel systems itself § But outperform them in graph analytics pipeline

slide-29
SLIDE 29

MLlib

Apache Spark: A Unified Engine for Big Data Processing PAGE 29

slide-30
SLIDE 30

MLlib

Apache Spark: A Unified Engine for Big Data Processing PAGE 30

§ More than 50 common algorithms for distributed model

training

§ Support pipeline construction on Spark § Integrate with other Spark libraries well

slide-31
SLIDE 31

Why use Apache Spark?

§ Ecosystem § Competitive performance § Low cost in sharing data § Low latency of MapReduce Steps § Control over bottleneck resources

Apache Spark: A Unified Engine for Big Data Processing PAGE 31

slide-32
SLIDE 32

Apache Spark in 2016

§ Apache Spark applications range from finance to scientific data

processing and combine libraries for SQL, machine learning, and graphs.

§ Apache Spark has grown to 1,000 contributors and thousands of

deployments from 2010 to 2016.

Apache Spark: A Unified Engine for Big Data Processing PAGE 32

slide-33
SLIDE 33

Apache Spark Today

Apache Spark: A Unified Engine for Big Data Processing PAGE 33

slide-34
SLIDE 34

Apache Spark: A Unified Engine for Big Data Processing

§ What is Apache Spark

§ Apache Spark = MapReduce + RDDs

§ How can it make multiple types of conversions over big data

§ Higher-level libraries enable Apache Spark to handle different types

  • f big data workload

Apache Spark: A Unified Engine for Big Data Processing PAGE 34

slide-35
SLIDE 35

Apache Spark: A Unified Engine for Big Data Processing PAGE 35

“Try Apache Spark if you are new to the big data processing world”

Huanyi Chen

slide-36
SLIDE 36

Q&A

§ What issues will it cause by persisting data in memory? For

example, garbage collection?

§ What are Parallel Random Access Machine model and Bulk

Synchronous Parallel model? Are these two models able to model any computation in distributed world?

§ Will optimizing one library cause other libraries to lose

performance?

§ Is using memory as the storage really the next generation of

storage?

Apache Spark: A Unified Engine for Big Data Processing PAGE 36