Big Data Processing with Apache Spark Jay Urbain, PhD Credits: - - PowerPoint PPT Presentation

▶

Oct 12, 2023 636 likes •1.32k views

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets Resilient Distributed Datasets A Fault-T A Fault-Tolerant Abstraction for In-Memory Cluster Computing olerant Abstraction for In-Memory Cluster

SLIDE 1

Big Data Processing with Apache Spark

Jay Urbain, PhD Credits: Resilient Distributed Datasets Resilient Distributed Datasets A Fault-T A Fault-Tolerant Abstraction for In-Memory Cluster Computing

lerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica http://spark.apache.org/

SLIDE 2

Motivation

SLIDE 3

Example: MapReduce

SLIDE 4

Example: MapReduce

SLIDE 5

Example: MapReduce

SLIDE 6

Example: MapReduce

SLIDE 7

Example: MapReduce

SLIDE 8

Idea: cache data in-memory

h"p://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf ¡ ¡

SLIDE 9

Example: MapReduce

h"p://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf ¡ ¡

SLIDE 10

Goal: In-Memory Data Sharing

SLIDE 11

Challenge

SLIDE 12

Challenge

h"p://web.stanford.edu/~ouster/cgi-‑bin/papers/ramcloud.pdf ¡ ¡ h"p://piccolo.news.cs.nyu.edu/piccolo.pdf ¡ ¡

SLIDE 13

Solution: Resilient Distributed Datasets (RDDs)

SLIDE 14

RDD Recovery

SLIDE 15

Generality of RDDs

SLIDE 16

Tradeoffs

SLIDE 17

Tradeoffs

SLIDE 18

Tradeoffs

SLIDE 19

SLIDE 20

SLIDE 21

SLIDE 22

SLIDE 23

h"p://databricks.com/blog/2014/11/05/spark-‑officially-‑sets-‑a-‑new-‑record-‑in-‑large-‑scale-‑sorDng.html ¡ ¡

SLIDE 24

Programming API

SLIDE 25

SLIDE 26

SLIDE 27

SLIDE 28

Programming Spark

Written in Scala “scah-lah” (runs on JVM)
Can write applications in Scala, Java, Python,

and R

Interactive: Scala, Python, R

SLIDE 29

SLIDE 30

SLIDE 31

h"p://mesos.apache.org/ ¡ ¡

SLIDE 32

SLIDE 33

SLIDE 34

SLIDE 35

SLIDE 36

Spark References

http://spark.apache.org/docs/latest/programming-

guide.html

http://spark.apache.org/docs/latest/api/python/index.html

SLIDE 37

SLIDE 38

SLIDE 39

SLIDE 40

SLIDE 41

SLIDE 42

SLIDE 43

SLIDE 44

SLIDE 45

SLIDE 46

SLIDE 47

SLIDE 48

SLIDE 49

SLIDE 50

SLIDE 51

SLIDE 52

SLIDE 53

SLIDE 54

SLIDE 55

SLIDE 56

SLIDE 57

SLIDE 58

SLIDE 59

SLIDE 60

SLIDE 61

SLIDE 62

SLIDE 63

h"p://shop.oreilly.com/product/0636920028512.do ¡

SLIDE 64

h"p://shop.oreilly.com/product/0636920028512.do ¡

SLIDE 65

h"p://shop.oreilly.com/product/0636920028512.do ¡

SLIDE 66

h"p://shop.oreilly.com/product/0636920028512.do ¡

SLIDE 67