Big Data Processing with Apache Spark Jay Urbain, PhD Credits: - - PowerPoint PPT Presentation

big data processing with apache spark
SMART_READER_LITE
LIVE PREVIEW

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: - - PowerPoint PPT Presentation

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets Resilient Distributed Datasets A Fault-T A Fault-Tolerant Abstraction for In-Memory Cluster Computing olerant Abstraction for In-Memory Cluster


slide-1
SLIDE 1

Big Data Processing with Apache Spark

Jay Urbain, PhD Credits: Resilient Distributed Datasets Resilient Distributed Datasets A Fault-T A Fault-Tolerant Abstraction for In-Memory Cluster Computing

  • lerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica http://spark.apache.org/

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Example: MapReduce

slide-4
SLIDE 4

Example: MapReduce

slide-5
SLIDE 5

Example: MapReduce

slide-6
SLIDE 6

Example: MapReduce

slide-7
SLIDE 7

Example: MapReduce

slide-8
SLIDE 8

Idea: cache data in-memory

h"p://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf ¡ ¡

slide-9
SLIDE 9

Example: MapReduce

h"p://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf ¡ ¡

slide-10
SLIDE 10

Goal: In-Memory Data Sharing

slide-11
SLIDE 11

Challenge

slide-12
SLIDE 12

Challenge

h"p://web.stanford.edu/~ouster/cgi-­‑bin/papers/ramcloud.pdf ¡ ¡ h"p://piccolo.news.cs.nyu.edu/piccolo.pdf ¡ ¡

slide-13
SLIDE 13

Solution: Resilient Distributed Datasets (RDDs)

slide-14
SLIDE 14

RDD Recovery

slide-15
SLIDE 15

Generality of RDDs

slide-16
SLIDE 16

Tradeoffs

slide-17
SLIDE 17

Tradeoffs

slide-18
SLIDE 18

Tradeoffs

slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23

h"p://databricks.com/blog/2014/11/05/spark-­‑officially-­‑sets-­‑a-­‑new-­‑record-­‑in-­‑large-­‑scale-­‑sorDng.html ¡ ¡

slide-24
SLIDE 24

Programming API

slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28

Programming Spark

  • Written in Scala “scah-lah” (runs on JVM)
  • Can write applications in Scala, Java, Python,

and R

  • Interactive: Scala, Python, R
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31

h"p://mesos.apache.org/ ¡ ¡

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

Spark References

  • http://spark.apache.org/docs/latest/programming-

guide.html

  • http://spark.apache.org/docs/latest/api/python/index.html
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62
slide-63
SLIDE 63

h"p://shop.oreilly.com/product/0636920028512.do ¡

slide-64
SLIDE 64

h"p://shop.oreilly.com/product/0636920028512.do ¡

slide-65
SLIDE 65

h"p://shop.oreilly.com/product/0636920028512.do ¡

slide-66
SLIDE 66

h"p://shop.oreilly.com/product/0636920028512.do ¡

slide-67
SLIDE 67