Big Data Processing Patrick Wendell Databricks About me Committer - - PowerPoint PPT Presentation

big data processing
SMART_READER_LITE
LIVE PREVIEW

Big Data Processing Patrick Wendell Databricks About me Committer - - PowerPoint PPT Presentation

Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks Focus is on networking


slide-1
SLIDE 1

Patrick Wendell

Databricks

Big Data Processing

slide-2
SLIDE 2

About me

Committer and PMC member of Apache Spark “Former” PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks Focus is on networking and operating systems

slide-3
SLIDE 3

Outline

Overview of today The Big Data problem The Spark Computing Framework

slide-4
SLIDE 4

Straw poll. Have you:

  • Written code in a numerical programming

environment (R/Matlab/Weka/etc)?

  • Written code in a general programming

language (Python/Java/etc)?

  • Written multi-threaded or distributed

computer programs?

slide-5
SLIDE 5

Today’s workshop

Overview of trends in large-scale data analysis Introduction to the Spark cluster-computing engine with a focus on numerical computing Hands-on workshop with TA’s covering Scala basics and using Spark for machine learning

slide-6
SLIDE 6

Hands-on Exercises

You’ll be given a cluster of 5 machines* (a) Text mining of full text of Wikipedia (b) Movie recommendations with ALS

5 machines * 200 people = 1,000 machines

slide-7
SLIDE 7

Outline

Overview of today The Big Data problem The Spark Computing Framework

slide-8
SLIDE 8

The Big Data Problem

Data is growing faster than computation speeds

Growing data sources

» Web, mobile, scientific, …

Cheap storage

» Doubling every 18 months

Stalling CPU speeds Storage bottlenecks

slide-9
SLIDE 9

Examples

Facebook’s daily logs: 60 TB 1000 genomes project: 200 TB Google web index: 10+ PB Cost of 1 TB of disk: $50 Time to read 1 TB from disk: 6 hours (50 MB/s)

slide-10
SLIDE 10

The Big Data Problem

Single machine can no longer process or even store all the data! Only solution is to distribute over large clusters

slide-11
SLIDE 11

Google Datacenter

How ¡do ¡we ¡program ¡this ¡thing? ¡

slide-12
SLIDE 12

What’s hard about cluster computing?

How to divide work across nodes?

  • Must consider network, data locality
  • Moving data may be very expensive

How to deal with failures?

  • 1 server fails every 3 years => 10K nodes see 10

faults/day

  • Even worse: stragglers (node is not failed, but slow)
slide-13
SLIDE 13

A comparison:

How to divide work across nodes? What if certain elements of a matrix/array became 1000 times slower to access than others? How to deal with failures? What if with 10% probability certain variables in your program became undefined?

slide-14
SLIDE 14

Outline

Overview of today The Big Data problem The Spark Computing Framework

slide-15
SLIDE 15

The Spark Computing Framework

Provides a programming abstraction and parallel runtime to hide this complexity. “Here’s an operation, run it on all of the data”

» I don’t care where it runs (you schedule that) » In fact, feel free to run it twice on different nodes

slide-16
SLIDE 16

Resilient Distributed Datasets

Key programming abstraction in Spark Think “parallel data frame” Supports collections/set API’s

# ¡get ¡variance ¡of ¡5th ¡field ¡of ¡a ¡tab-­‑delimited ¡dataset ¡

rdd ¡= ¡spark.hadoopFile(“big-­‑file.txt”) ¡ ¡ rdd.map(line ¡=> ¡line.split(“\t”)(4)) ¡ ¡ ¡ ¡.map(cell ¡=> ¡cell.toDouble()) ¡ ¡ ¡ ¡.variance() ¡

slide-17
SLIDE 17

RDD Execution

Automatically split work into many small, idempotent tasks Send tasks to nodes based on data locality Load-balance dynamically as tasks finish

slide-18
SLIDE 18

Fault Recovery

  • 1. If a task crashes:

» Retry on another node

  • OK for a map because it had no dependencies
  • OK for reduce because map outputs are on disk

Requires ¡user ¡code ¡to ¡be ¡deterministic ¡

slide-19
SLIDE 19

Fault ¡Recovery ¡

  • 2. If a node crashes:

» Relaunch its current tasks on other nodes » Relaunch any maps the node previously ran

  • Necessary because their output files were lost
slide-20
SLIDE 20

Fault ¡Recovery ¡

  • 3. If a task is going slowly (straggler):

» Launch second copy of task on another node » Take the output of whichever copy finishes first, and kill the other one

slide-21
SLIDE 21

Spark Compared with Earlier Approaches

Higher level, declarative API. Built from the “ground up” for performance – including support for leveraging distributed cluster memory. Optimized for iterative computations (e.g. machine learning)

slide-22
SLIDE 22

Spark Platform

Spark ¡RDD ¡API ¡

slide-23
SLIDE 23

Spark Platform: GraphX

Spark ¡RDD ¡API ¡

GraphX ¡

Graph ¡ (alpha) ¡

RDD-­‑Based ¡ ¡ Graphs ¡

graph ¡= ¡Graph(vertexRDD, ¡edgeRDD) ¡ graph.connectedComponents() ¡# ¡returns ¡a ¡new ¡RDD ¡

slide-24
SLIDE 24

Spark Platform: MLLib

Spark ¡RDD ¡API ¡

GraphX ¡

Graph ¡ (alpha) ¡

MLLib ¡

machine ¡ learning ¡

RDD-­‑Based ¡ ¡Matrices ¡ RDD-­‑Based ¡ ¡ Graphs ¡

model ¡= ¡LogisticRegressionWithSGD.train(trainRDD) ¡ dataRDD.map(point ¡=> ¡model.predict(point)) ¡

slide-25
SLIDE 25

Spark Platform: Streaming

Spark ¡RDD ¡API ¡

GraphX ¡

Graph ¡ (alpha) ¡

MLLib ¡

machine ¡ learning ¡

RDD-­‑Based ¡ ¡Matrices ¡ RDD-­‑Based ¡ ¡ Graphs ¡

dstream ¡= ¡spark.networkInputStream() ¡ dstream.countByWindow(Seconds(30)) ¡

Streaming ¡

RDD-­‑Based ¡ ¡Streams ¡

slide-26
SLIDE 26

Spark Platform: SQL

Spark ¡RDD ¡API ¡

GraphX ¡

Graph ¡ (alpha) ¡

MLLib ¡

machine ¡ learning ¡

RDD-­‑Based ¡ ¡Matrices ¡ RDD-­‑Based ¡ ¡ Graphs ¡

rdd ¡= ¡sql“select ¡* ¡from ¡rdd1 ¡where ¡age ¡> ¡10” ¡

Streaming ¡

RDD-­‑Based ¡ ¡Streams ¡

SQL ¡

Schema ¡RDD’s ¡

slide-27
SLIDE 27

Performance

Impala ¡(disk) ¡ Impala ¡(mem) ¡ Redshift ¡ Shark ¡(disk) ¡ Shark ¡(mem) ¡ 0 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ Response ¡Time ¡(s) ¡

SQL[1] ¡

Storm ¡ Spark ¡ 0 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ 35 ¡ Throughput ¡(MB/s/node) ¡

Streaming[2] ¡

Hadoop ¡ Giraph ¡ GraphX ¡ 0 ¡ 5 ¡ 10 ¡ 15 ¡ 20 ¡ 25 ¡ 30 ¡ Response ¡Time ¡(min) ¡

Graph[3] ¡

[1] https://amplab.cs.berkeley.edu/benchmark/ [2] Discretized Streams: Fault-Tolerant Streaming Computation at Scale. At SOSP 2013. [3] https://amplab.cs.berkeley.edu/publication/graphx-grades/

slide-28
SLIDE 28

Composition

Most large-scale data programs involve parsing, filtering, and cleaning data. Spark allows users to compose patterns

  • elegantly. E.g.:

Select input dataset with SQL, then run machine learning on the result.

slide-29
SLIDE 29

One of the largest open source projects in big data 150+ developers contributing 30+ companies contributing

Spark Community

50 100 150

Contributors in past year

slide-30
SLIDE 30

Community Growth

Spark 0.6: Spark 0.6: 17 contributors 17 contributors

Sept ‘13 Feb ‘13 Oct ‘12

Spark 0.7: Spark 0.7: 31 contributors 31 contributors Spark 0.8: Spark 0.8: 67 contributors 67 contributors

Feb ‘14

Spark 0.9: Spark 0.9: 83 contributors 83 contributors

slide-31
SLIDE 31

Databricks

Primary developers of Spark today Founded by Spark project creators A nexus of several research areas: à OS and computer architecture à Networking and distributed systems à Statistics and machine learning

slide-32
SLIDE 32

Today’s Speakers

Holden Karau – Worked on large-scale storage infrastructure @ Google. Intro to Spark and Scala Hossein Falaki – Data scientist at Apple. Numerical Programming with Spark Xiangrui Meng – PhD from Stanford ICME. Lead contributor on MLLib. Deep dive on Spark’s MLLib

slide-33
SLIDE 33

Questions?