[PDF] - Data-Intensive Distributed Computing CS 431/631 (Fall 2020) Part 3: PDF Document

SLIDE 1

Data-Intensive Distributed Computing

Part 3: From MapReduce to Spark (1/2)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 (Fall 2020) Ali Abedi

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/

1

SLIDE 2

Source: Google

The datacenter is the computer!

What’s the instruction set?

2

SLIDE 3

Abstraction CPU Cluster of computers Instruction set

Map/Reduce Combine/Partition

3

We need a solution for both storage and computing.

3

SLIDE 4

Source: Wikipedia (ENIAC)

So you like programming in assembly?

4

So when we program in MapReduce is it like programming in assembly?! How can we do better? 4

SLIDE 5

Design a higher-level language Write a compiler

What’s the solution?

5

SLIDE 6

Hadoop is great, but it’s really waaaaay too low level!

What we really need is SQL! What we really need is a scripting language! Answer: Answer:

6

Yahoo and Facebook designed their own solutions on top of Hadoop to make it more flexible for their engineers. 6

SLIDE 7

SQL Pig Scripts Both open-source projects today!

7

SLIDE 8

HDFS MapReduce Hive Pig

8

Pig and Hive programs are converted to MapReduce jobs at the end of the day. 8

SLIDE 9

Source: Wikipedia (Pig)

Pig!

9

SLIDE 10

User Url Time

Amy cnn.com 8:00 Amy bbc.com 10:00 Amy flickr.com 10:05 Fred cnn.com 12:00

Url Category PageRank

cnn.com News 0.9 bbc.com News 0.8 flickr.com Photos 0.7 espn.com Sports 0.9

Visits URL Info Task: Find the top 10 most visited pages in each category

Pig Slides adapted from Olston et al. (SIGMOD 2008)

Pig: Example

10

SLIDE 11

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Pig Slides adapted from Olston et al. (SIGMOD 2008)

Pig: Example Script

11

SLIDE 12

load visits group by url foreach url generate count load urlInfo join on url group by category foreach category generate top(urls, 10)

Pig Slides adapted from Olston et al. (SIGMOD 2008)

Pig Query Plan

12

SLIDE 13

load visits group by url foreach url generate count load urlInfo join on url group by category foreach category generate top(urls, 10) Map1 Reduce1 Map2 Reduce2 Map3 Reduce3

Pig Slides adapted from Olston et al. (SIGMOD 2008)

Pig: MapReduce Execution

13

SLIDE 14

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

14

SLIDE 15

But isn’t Pig slower?

Sure, but c can be slower than assembly too…

15

SLIDE 16

Source: Google

The datacenter is the computer! What’s the instruction set? Okay, let’s fix this!

16

Having to formulate the problem in terms of map and reduce only is restrictive. 16

SLIDE 17

reduce map HDFS HDFS reduce map HDFS reduce map HDFS reduce map HDFS

What’s wrong?

MapReduce Workflows

17

There is a lot of disk i/o involved which significantly reduces running MapReduce jobs like this. 17

SLIDE 18

map HDFS HDFS map HDFS map HDFS map HDFS

✔ ✗

Want MM?

18

It’s okay not to have reduce but the output of map cannot go to another map. 18

SLIDE 19

reduce map HDFS HDFS reduce map HDFS reduce map reduce HDFS HDFS

✔ ✗

Want MRR?

19

Similarly we cannot directly move the output of reduce to another reduc. 19

SLIDE 20

Source: Google

The datacenter is the computer! Let’s enrich the instruction set!

20

Can we add more operations to make the instruction set more flexible? 20

SLIDE 21

Spark

Answer to “What’s beyond MapReduce?” Brief history:

Developed at UC Berkeley AMPLab in 2009 Open-sourced in 2010 Became top-level Apache project in February 2014

21

SLIDE 22

Google Trends

Spark vs. Hadoop

September2014

22

Spark Hadoop

Spark is more popular than Hadoop today. 22

SLIDE 23

map

f: (K1, V1) ⇒ List[(K2, V2)] List[(K1,V1)] List[K3,V3])

reduce

g: (K2, Iterable[V2]) ⇒ List[(K3, V3)]

MapReduce

This is the only mechanism we had in MapReduce. 23

SLIDE 24

RDD[T] RDD[T]

filter

f: (T) ⇒ Boolean

map

f: (T) ⇒ U RDD[T] RDD[U]

flatMap

f: (T) ⇒ TraversableOnce[U] RDD[T] RDD[U]

mapPartitions

f: (Iterator[T]) ⇒ Iterator[U] RDD[T] RDD[U]

Map-like Operations

But Spark provides many more operations (enriched instruction set). 24

SLIDE 25

RDD[(K, V)] RDD[(K, Iterable[V])]

groupByKey reduceByKey

f: (V, V) ⇒ V RDD[(K, V)] RDD[(K, V)] RDD[(K, V)]

aggregateByKey

seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U RDD[(K, U)]

Reduce-like Operations

25

SLIDE 26

And many other operations!

26