Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se - - PowerPoint PPT Presentation

data intensive computing frameworks
SMART_READER_LITE
LIVE PREVIEW

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se - - PowerPoint PPT Presentation

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95 Big Data small data big data Amir H. Payberah (AUT) Data


slide-1
SLIDE 1

Data Intensive Computing Frameworks

Amir H. Payberah

amir@sics.se

Amirkabir University of Technology 1394/2/25

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95

slide-2
SLIDE 2

Big Data

small data big data

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 2 / 95

slide-3
SLIDE 3

◮ Big Data refers to datasets and flows large

enough that has outpaced our capability to store, process, analyze, and understand.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 3 / 95

slide-4
SLIDE 4

The Four Dimensions of Big Data

◮ Volume: data size ◮ Velocity: data generation rate ◮ Variety: data heterogeneity ◮ This 4th V is for Vacillation:

Veracity/Variability/Value

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 4 / 95

slide-5
SLIDE 5

Where Does Big Data Come From?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 5 / 95

slide-6
SLIDE 6

Big Data Market Driving Factors

The number of web pages indexed by Google, which were around

  • ne million in 1998, have exceeded one trillion in 2008, and its

expansion is accelerated by appearance of the social networks.∗

∗“Mining big data: current status, and forecast to the future” [Wei Fan et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 6 / 95

slide-7
SLIDE 7

Big Data Market Driving Factors

The amount of mobile data traffic is expected to grow to 10.8 Exabyte per month by 2016.∗

∗“Worldwide Big Data Technology and Services 2012-2015 Forecast” [Dan Vesset et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 7 / 95

slide-8
SLIDE 8

Big Data Market Driving Factors

More than 65 billion devices were connected to the Internet by 2010, and this number will go up to 230 billion by 2020.∗

∗“The Internet of Things Is Coming” [John Mahoney et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 8 / 95

slide-9
SLIDE 9

Big Data Market Driving Factors

Many companies are moving towards using Cloud services to access Big Data analytical tools.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 9 / 95

slide-10
SLIDE 10

Big Data Market Driving Factors

Open source communities

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 10 / 95

slide-11
SLIDE 11

How To Store and Process Big Data?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 11 / 95

slide-12
SLIDE 12

But First, The History

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 12 / 95

slide-13
SLIDE 13

4000 B.C

◮ Manual recording ◮ From tablets to papyrus, to parchment, and then to paper

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 13 / 95

slide-14
SLIDE 14

1450

◮ Gutenberg’s printing press

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 14 / 95

slide-15
SLIDE 15

1800’s - 1940’s

◮ Punched cards (no fault-tolerance) ◮ Binary data ◮ 1890: US census ◮ 1911: IBM appeared

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 15 / 95

slide-16
SLIDE 16

1940’s - 1950’s

◮ Magnetic tapes

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 16 / 95

slide-17
SLIDE 17

1950’s - 1960’s

◮ Large-scale mainframe computers ◮ Batch transaction processing ◮ File-oriented record processing model (e.g., COBOL)

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 17 / 95

slide-18
SLIDE 18

1960’s - 1970’s

◮ Hierarchical DBMS (one-to-many) ◮ Network DBMS (many-to-many) ◮ VM OS by IBM → multiple VMs on a single physical node.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 18 / 95

slide-19
SLIDE 19

1970’s - 1980’s

◮ Relational DBMS (tables) and SQL ◮ ACID ◮ Client-server computing ◮ Parallel processing

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 19 / 95

slide-20
SLIDE 20

1990’s - 2000’s

◮ Virtualized Private Network connections (VPN) ◮ The Internet...

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 20 / 95

slide-21
SLIDE 21

2000’s - Now

◮ Cloud computing ◮ NoSQL: BASE instead of ACID ◮ Big Data

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 21 / 95

slide-22
SLIDE 22

How To Store and Process Big Data?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 22 / 95

slide-23
SLIDE 23

Scale Up vs. Scale Out (1/2)

◮ Scale up or scale vertically: adding resources to a single node in a

system.

◮ Scale out or scale horizontally: adding more nodes to a system.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 23 / 95

slide-24
SLIDE 24

Scale Up vs. Scale Out (2/2)

◮ Scale up: more expensive than scaling out. ◮ Scale out: more challenging for fault tolerance and software devel-

  • pment.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 24 / 95

slide-25
SLIDE 25

Taxonomy of Parallel Architectures

DeWitt, D. and Gray, J. “Parallel database systems: the future of high performance database systems”. ACM Communications, 35(6), 85-98, 1992. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 25 / 95

slide-26
SLIDE 26

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 26 / 95

slide-27
SLIDE 27

Two Main Types of Tools

◮ Data store ◮ Data processing

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 27 / 95

slide-28
SLIDE 28

Data Store

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 28 / 95

slide-29
SLIDE 29

Data Store

◮ How to store and access files? File System

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 29 / 95

slide-30
SLIDE 30

What is Filesystem?

◮ Controls how data is stored in and retrieved from disk.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95

slide-31
SLIDE 31

What is Filesystem?

◮ Controls how data is stored in and retrieved from disk.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95

slide-32
SLIDE 32

Distributed Filesystems

◮ When data outgrows the storage capacity of a single machine.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95

slide-33
SLIDE 33

Distributed Filesystems

◮ When data outgrows the storage capacity of a single machine. ◮ Partition data across a number of separate machines.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95

slide-34
SLIDE 34

Distributed Filesystems

◮ When data outgrows the storage capacity of a single machine. ◮ Partition data across a number of separate machines. ◮ Distributed filesystems: manage the storage across a network of

machines.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95

slide-35
SLIDE 35

Distributed Filesystems

◮ When data outgrows the storage capacity of a single machine. ◮ Partition data across a number of separate machines. ◮ Distributed filesystems: manage the storage across a network of

machines.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95

slide-36
SLIDE 36

HDFS (1/2)

◮ Hadoop Distributed FileSystem ◮ Appears as a single disk ◮ Runs on top of a native filesystem, e.g., ext3

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 32 / 95

slide-37
SLIDE 37

HDFS (2/2)

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 33 / 95

slide-38
SLIDE 38

Files and Blocks (1/2)

◮ Files are split into blocks. ◮ Blocks, the single unit of storage.

  • Transparent to user.
  • 64MB or 128MB.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 34 / 95

slide-39
SLIDE 39

Files and Blocks (2/2)

◮ Same block is replicated on multiple machines: default is 3

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 35 / 95

slide-40
SLIDE 40

HDFS Write

◮ 1. Create a new file in the Namenode’s Namespace; calculate block

topology.

◮ 2, 3, 4. Stream data to the first, second and third node. ◮ 5, 6, 7. Success/failure acknowledgment.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 36 / 95

slide-41
SLIDE 41

HDFS Read

◮ 1. Retrieve block locations. ◮ 2, 3. Read blocks to re-assemble the file.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 37 / 95

slide-42
SLIDE 42

What About Databases?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 38 / 95

slide-43
SLIDE 43

Database and Database Management System

◮ Database: an organized collection of data.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95

slide-44
SLIDE 44

Database and Database Management System

◮ Database: an organized collection of data. ◮ Database Management System (DBMS): a software that interacts

with users, other applications, and the database itself to capture and analyze data.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95

slide-45
SLIDE 45

Relational Databases Management Systems (RDMBSs)

◮ RDMBSs: the dominant technology for storing structured data in

web and business applications.

◮ SQL is good

  • Rich language and toolset
  • Easy to use and integrate
  • Many vendors

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95

slide-46
SLIDE 46

Relational Databases Management Systems (RDMBSs)

◮ RDMBSs: the dominant technology for storing structured data in

web and business applications.

◮ SQL is good

  • Rich language and toolset
  • Easy to use and integrate
  • Many vendors

◮ They promise: ACID

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95

slide-47
SLIDE 47

ACID Properties

◮ Atomicity ◮ Consistency ◮ Isolation ◮ Durability

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 41 / 95

slide-48
SLIDE 48

RDBMS Challenges

◮ Web-based applications caused spikes.

  • Internet-scale data size
  • High read-write rates
  • Frequent schema changes

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 42 / 95

slide-49
SLIDE 49

Scaling RDBMSs is Expensive and Inefficient

[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 43 / 95

slide-50
SLIDE 50

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 44 / 95

slide-51
SLIDE 51

NoSQL

◮ Avoidance of unneeded complexity ◮ High throughput ◮ Horizontal scalability and running on commodity hardware

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 45 / 95

slide-52
SLIDE 52

NoSQL Cost and Performance

[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 46 / 95

slide-53
SLIDE 53

RDBMS vs. NoSQL

[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 47 / 95

slide-54
SLIDE 54

NoSQL Data Models

[http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 48 / 95

slide-55
SLIDE 55

NoSQL Data Models: Key-Value

◮ Collection of key/value pairs. ◮ Ordered Key-Value: processing over key ranges. ◮ Dynamo, Scalaris, Voldemort, Riak, ...

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 49 / 95

slide-56
SLIDE 56

NoSQL Data Models: Column-Oriented

◮ Similar to a key/value store, but the value can have multiple at-

tributes (Columns).

◮ Column: a set of data values of a particular type. ◮ BigTable, Hbase, Cassandra, ...

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 50 / 95

slide-57
SLIDE 57

NoSQL Data Models: Document-Based

◮ Similar to a column-oriented store, but values can have complex

documents, e.g., XML, YAML, JSON, and BSON.

◮ CouchDB, MongoDB, ... { FirstName: "Bob", Address: "5 Oak St.", Hobby: "sailing" } { FirstName: "Jonathan", Address: "15 Wanamassa Point Road", Children: [ {Name: "Michael", Age: 10}, {Name: "Jennifer", Age: 8}, ] }

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 51 / 95

slide-58
SLIDE 58

NoSQL Data Models: Graph-Based

◮ Uses graph structures with nodes, edges, and properties to represent

and store data.

◮ Neo4J, InfoGrid, ...

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 52 / 95

slide-59
SLIDE 59

Data Processing

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 53 / 95

slide-60
SLIDE 60

Challenges

◮ How to distribute computation? ◮ How can we make it easy to write distributed programs? ◮ Machines failure.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 54 / 95

slide-61
SLIDE 61

Idea

◮ Issue:

  • Copying data over a network takes time.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95

slide-62
SLIDE 62

Idea

◮ Issue:

  • Copying data over a network takes time.

◮ Idea:

  • Bring computation close to the data.
  • Store files multiple times for reliability.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95

slide-63
SLIDE 63

MapReduce

◮ A shared nothing architecture for processing large data sets with a

parallel/distributed algorithm on clusters.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 56 / 95

slide-64
SLIDE 64

Simplicity

◮ Don’t worry about parallelization, fault tolerance, data distribution,

and load balancing (MapReduce takes care of these).

◮ Hide system-level details from programmers.

Simplicity!

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 57 / 95

slide-65
SLIDE 65

Warm-up Task (1/2)

◮ We have a huge text document. ◮ Count the number of times each distinct word appears in the file

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 58 / 95

slide-66
SLIDE 66

Warm-up Task (2/2)

◮ File too large for memory, but all word, count pairs fit in memory.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

slide-67
SLIDE 67

Warm-up Task (2/2)

◮ File too large for memory, but all word, count pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c

  • where words takes a file and outputs the words in it, one per a line

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

slide-68
SLIDE 68

Warm-up Task (2/2)

◮ File too large for memory, but all word, count pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c

  • where words takes a file and outputs the words in it, one per a line

◮ It captures the essence of MapReduce: great thing is that it is

naturally parallelizable.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

slide-69
SLIDE 69

MapReduce Overview

◮ words(doc.txt) | sort | uniq -c

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

slide-70
SLIDE 70

MapReduce Overview

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

slide-71
SLIDE 71

MapReduce Overview

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

slide-72
SLIDE 72

MapReduce Overview

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

slide-73
SLIDE 73

MapReduce Overview

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

slide-74
SLIDE 74

MapReduce Overview

◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. ◮ Write the result.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

slide-75
SLIDE 75

Example: Word Count

◮ Consider doing a word count of the following file using MapReduce:

Hello World Bye World Hello Hadoop Goodbye Hadoop

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 61 / 95

slide-76
SLIDE 76

Example: Word Count - map

◮ The map function reads in words one a time and outputs (word, 1)

for each parsed input word.

◮ The map function output is:

(Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1)

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 62 / 95

slide-77
SLIDE 77

Example: Word Count - shuffle

◮ The shuffle phase between map and reduce phase creates a list of

values associated with each key.

◮ The reduce function input is:

(Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1))

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 63 / 95

slide-78
SLIDE 78

Example: Word Count - reduce

◮ The reduce function sums the numbers in the list for each key and

  • utputs (word, count) pairs.

◮ The output of the reduce function is the output of the MapReduce

job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2)

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 64 / 95

slide-79
SLIDE 79

Example: Word Count - map

public static class MyMap extends Mapper<...> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 65 / 95

slide-80
SLIDE 80

Example: Word Count - reduce

public static class MyReduce extends Reducer<...> { public void reduce(Text key, Iterator<...> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) sum += values.next().get(); context.write(key, new IntWritable(sum)); } }

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 66 / 95

slide-81
SLIDE 81

Example: Word Count - driver

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MyMap.class); job.setReducerClass(MyReduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 67 / 95

slide-82
SLIDE 82

MapReduce Execution

  • J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, ACM Communications 51(1), 2008.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 68 / 95

slide-83
SLIDE 83

MapReduce Weaknesses

◮ MapReduce programming model has not been designed for complex

  • perations, e.g., data mining.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95

slide-84
SLIDE 84

MapReduce Weaknesses

◮ MapReduce programming model has not been designed for complex

  • perations, e.g., data mining.

◮ Very expensive, i.e., always goes to disk and HDFS.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95

slide-85
SLIDE 85

Solution?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 70 / 95

slide-86
SLIDE 86

Spark

◮ Extends MapReduce with more operators. ◮ Support for advanced data flow graphs. ◮ In-memory and out-of-core processing.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 71 / 95

slide-87
SLIDE 87

Spark vs. Hadoop

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95

slide-88
SLIDE 88

Spark vs. Hadoop

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95

slide-89
SLIDE 89

Spark vs. Hadoop

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95

slide-90
SLIDE 90

Spark vs. Hadoop

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95

slide-91
SLIDE 91

Resilient Distributed Datasets (RDD)

◮ Immutable collections of objects spread across a cluster. ◮ An RDD is divided into a number of partitions. ◮ Partitions of an RDD can be stored on different nodes of a cluster.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 74 / 95

slide-92
SLIDE 92

What About Streaming Data?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 75 / 95

slide-93
SLIDE 93

Motivation

◮ Many applications must process large streams of live data and pro-

vide results in real-time.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

slide-94
SLIDE 94

Motivation

◮ Many applications must process large streams of live data and pro-

vide results in real-time.

◮ Processing information as it flows, without storing them persistently.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

slide-95
SLIDE 95

Motivation

◮ Many applications must process large streams of live data and pro-

vide results in real-time.

◮ Processing information as it flows, without storing them persistently. ◮ Traditional DBMSs:

  • Store and index data before processing it.
  • Process data only when explicitly asked by the users.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

slide-96
SLIDE 96

DBMS vs. DSMS (1/3)

◮ DBMS: persistent data where updates are relatively infrequent. ◮ DSMS: transient data that is continuously updated.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 77 / 95

slide-97
SLIDE 97

DBMS vs. DSMS (2/3)

◮ DBMS: runs queries just once to return a complete answer. ◮ DSMS: executes standing queries, which run continuously and pro-

vide updated answers as new data arrives.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 78 / 95

slide-98
SLIDE 98

DBMS vs. DSMS (3/3)

◮ Despite these differences, DSMSs resemble DBMSs: both process

incoming data through a sequence of transformations based on SQL

  • perators, e.g., selections, aggregates, joins.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 79 / 95

slide-99
SLIDE 99

DSMS

◮ Source: produces the incoming information flows ◮ Sink: consumes the results of processing ◮ IFP engine: processes incoming flows ◮ Processing rules: how to process the incoming flows ◮ Rule manager: adds/removes processing rules

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 80 / 95

slide-100
SLIDE 100

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 81 / 95

slide-101
SLIDE 101

What About Graph Data?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 82 / 95

slide-102
SLIDE 102

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 83 / 95

slide-103
SLIDE 103

Large Graph

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 84 / 95

slide-104
SLIDE 104

Large-Scale Graph Processing

◮ Large graphs need large-scale processing. ◮ A large graph either cannot fit into memory of single computer or

it fits with huge cost.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 85 / 95

slide-105
SLIDE 105

Data-Parallel Model for Large-Scale Graph Processing

◮ The platforms that have worked well for developing parallel applica-

tions are not necessarily effective for large-scale graph problems.

◮ Why?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 86 / 95

slide-106
SLIDE 106

Graph Algorithms Characteristics

◮ Unstructured problems: difficult to partition the data ◮ Data-driven computations: difficult to partition computation ◮ Poor data locality ◮ High data access to computation ratio

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 87 / 95

slide-107
SLIDE 107

Proposed Solution

Graph-Parallel Processing ◮ Computation typically depends on the neighbors.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 88 / 95

slide-108
SLIDE 108

Graph-Parallel Processing

◮ Restricts the types of computation. ◮ New techniques to partition and distribute graphs. ◮ Exploit graph structure. ◮ Executes graph algorithms orders-of-magnitude faster than more

general data-parallel systems.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 89 / 95

slide-109
SLIDE 109

Data-Parallel vs. Graph-Parallel Computation

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 90 / 95

slide-110
SLIDE 110

Vertex-Centric Programing

◮ Think as a vertex. ◮ Each vertex computes individually its value: in parallel ◮ Each vertex can see its local context, and updates its value accord-

ingly.

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 91 / 95

slide-111
SLIDE 111

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 92 / 95

slide-112
SLIDE 112

Summary

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 93 / 95

slide-113
SLIDE 113

Summary

◮ Scale-out vs. Scale-up

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95

slide-114
SLIDE 114

Summary

◮ Scale-out vs. Scale-up ◮ How to store data?

  • Distributed file systems: HDFS
  • NoSQL databases: HBase, Cassandra, ...

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95

slide-115
SLIDE 115

Summary

◮ Scale-out vs. Scale-up ◮ How to store data?

  • Distributed file systems: HDFS
  • NoSQL databases: HBase, Cassandra, ...

◮ How to process data?

  • Batch data: MapReduce, Spark
  • Streaming data: Spark stream, Flink, Storm, S4
  • Graph data: Giraph, GraphLab, GraphX, Flink

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95

slide-116
SLIDE 116

Questions?

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 95 / 95