Data Intensive Computing Frameworks
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology 1394/2/25
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95
Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se - - PowerPoint PPT Presentation
Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95 Big Data small data big data Amir H. Payberah (AUT) Data
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology 1394/2/25
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95
small data big data
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 2 / 95
◮ Big Data refers to datasets and flows large
enough that has outpaced our capability to store, process, analyze, and understand.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 3 / 95
◮ Volume: data size ◮ Velocity: data generation rate ◮ Variety: data heterogeneity ◮ This 4th V is for Vacillation:
Veracity/Variability/Value
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 4 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 5 / 95
The number of web pages indexed by Google, which were around
expansion is accelerated by appearance of the social networks.∗
∗“Mining big data: current status, and forecast to the future” [Wei Fan et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 6 / 95
The amount of mobile data traffic is expected to grow to 10.8 Exabyte per month by 2016.∗
∗“Worldwide Big Data Technology and Services 2012-2015 Forecast” [Dan Vesset et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 7 / 95
More than 65 billion devices were connected to the Internet by 2010, and this number will go up to 230 billion by 2020.∗
∗“The Internet of Things Is Coming” [John Mahoney et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 8 / 95
Many companies are moving towards using Cloud services to access Big Data analytical tools.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 9 / 95
Open source communities
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 10 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 11 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 12 / 95
◮ Manual recording ◮ From tablets to papyrus, to parchment, and then to paper
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 13 / 95
◮ Gutenberg’s printing press
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 14 / 95
◮ Punched cards (no fault-tolerance) ◮ Binary data ◮ 1890: US census ◮ 1911: IBM appeared
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 15 / 95
◮ Magnetic tapes
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 16 / 95
◮ Large-scale mainframe computers ◮ Batch transaction processing ◮ File-oriented record processing model (e.g., COBOL)
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 17 / 95
◮ Hierarchical DBMS (one-to-many) ◮ Network DBMS (many-to-many) ◮ VM OS by IBM → multiple VMs on a single physical node.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 18 / 95
◮ Relational DBMS (tables) and SQL ◮ ACID ◮ Client-server computing ◮ Parallel processing
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 19 / 95
◮ Virtualized Private Network connections (VPN) ◮ The Internet...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 20 / 95
◮ Cloud computing ◮ NoSQL: BASE instead of ACID ◮ Big Data
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 21 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 22 / 95
◮ Scale up or scale vertically: adding resources to a single node in a
system.
◮ Scale out or scale horizontally: adding more nodes to a system.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 23 / 95
◮ Scale up: more expensive than scaling out. ◮ Scale out: more challenging for fault tolerance and software devel-
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 24 / 95
DeWitt, D. and Gray, J. “Parallel database systems: the future of high performance database systems”. ACM Communications, 35(6), 85-98, 1992. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 25 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 26 / 95
◮ Data store ◮ Data processing
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 27 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 28 / 95
◮ How to store and access files? File System
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 29 / 95
◮ Controls how data is stored in and retrieved from disk.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
◮ Controls how data is stored in and retrieved from disk.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
◮ When data outgrows the storage capacity of a single machine.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
◮ When data outgrows the storage capacity of a single machine. ◮ Partition data across a number of separate machines.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
◮ When data outgrows the storage capacity of a single machine. ◮ Partition data across a number of separate machines. ◮ Distributed filesystems: manage the storage across a network of
machines.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
◮ When data outgrows the storage capacity of a single machine. ◮ Partition data across a number of separate machines. ◮ Distributed filesystems: manage the storage across a network of
machines.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
◮ Hadoop Distributed FileSystem ◮ Appears as a single disk ◮ Runs on top of a native filesystem, e.g., ext3
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 32 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 33 / 95
◮ Files are split into blocks. ◮ Blocks, the single unit of storage.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 34 / 95
◮ Same block is replicated on multiple machines: default is 3
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 35 / 95
◮ 1. Create a new file in the Namenode’s Namespace; calculate block
topology.
◮ 2, 3, 4. Stream data to the first, second and third node. ◮ 5, 6, 7. Success/failure acknowledgment.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 36 / 95
◮ 1. Retrieve block locations. ◮ 2, 3. Read blocks to re-assemble the file.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 37 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 38 / 95
◮ Database: an organized collection of data.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
◮ Database: an organized collection of data. ◮ Database Management System (DBMS): a software that interacts
with users, other applications, and the database itself to capture and analyze data.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
◮ RDMBSs: the dominant technology for storing structured data in
web and business applications.
◮ SQL is good
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
◮ RDMBSs: the dominant technology for storing structured data in
web and business applications.
◮ SQL is good
◮ They promise: ACID
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
◮ Atomicity ◮ Consistency ◮ Isolation ◮ Durability
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 41 / 95
◮ Web-based applications caused spikes.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 42 / 95
[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 43 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 44 / 95
◮ Avoidance of unneeded complexity ◮ High throughput ◮ Horizontal scalability and running on commodity hardware
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 45 / 95
[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 46 / 95
[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 47 / 95
[http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 48 / 95
◮ Collection of key/value pairs. ◮ Ordered Key-Value: processing over key ranges. ◮ Dynamo, Scalaris, Voldemort, Riak, ...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 49 / 95
◮ Similar to a key/value store, but the value can have multiple at-
tributes (Columns).
◮ Column: a set of data values of a particular type. ◮ BigTable, Hbase, Cassandra, ...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 50 / 95
◮ Similar to a column-oriented store, but values can have complex
documents, e.g., XML, YAML, JSON, and BSON.
◮ CouchDB, MongoDB, ... { FirstName: "Bob", Address: "5 Oak St.", Hobby: "sailing" } { FirstName: "Jonathan", Address: "15 Wanamassa Point Road", Children: [ {Name: "Michael", Age: 10}, {Name: "Jennifer", Age: 8}, ] }
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 51 / 95
◮ Uses graph structures with nodes, edges, and properties to represent
and store data.
◮ Neo4J, InfoGrid, ...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 52 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 53 / 95
◮ How to distribute computation? ◮ How can we make it easy to write distributed programs? ◮ Machines failure.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 54 / 95
◮ Issue:
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95
◮ Issue:
◮ Idea:
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95
◮ A shared nothing architecture for processing large data sets with a
parallel/distributed algorithm on clusters.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 56 / 95
◮ Don’t worry about parallelization, fault tolerance, data distribution,
and load balancing (MapReduce takes care of these).
◮ Hide system-level details from programmers.
Simplicity!
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 57 / 95
◮ We have a huge text document. ◮ Count the number of times each distinct word appears in the file
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 58 / 95
◮ File too large for memory, but all word, count pairs fit in memory.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
◮ File too large for memory, but all word, count pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
◮ File too large for memory, but all word, count pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c
◮ It captures the essence of MapReduce: great thing is that it is
naturally parallelizable.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
◮ words(doc.txt) | sort | uniq -c
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. ◮ Write the result.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
◮ Consider doing a word count of the following file using MapReduce:
Hello World Bye World Hello Hadoop Goodbye Hadoop
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 61 / 95
◮ The map function reads in words one a time and outputs (word, 1)
for each parsed input word.
◮ The map function output is:
(Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1)
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 62 / 95
◮ The shuffle phase between map and reduce phase creates a list of
values associated with each key.
◮ The reduce function input is:
(Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1))
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 63 / 95
◮ The reduce function sums the numbers in the list for each key and
◮ The output of the reduce function is the output of the MapReduce
job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2)
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 64 / 95
public static class MyMap extends Mapper<...> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 65 / 95
public static class MyReduce extends Reducer<...> { public void reduce(Text key, Iterator<...> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) sum += values.next().get(); context.write(key, new IntWritable(sum)); } }
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 66 / 95
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MyMap.class); job.setReducerClass(MyReduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 67 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 68 / 95
◮ MapReduce programming model has not been designed for complex
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95
◮ MapReduce programming model has not been designed for complex
◮ Very expensive, i.e., always goes to disk and HDFS.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 70 / 95
Spark
◮ Extends MapReduce with more operators. ◮ Support for advanced data flow graphs. ◮ In-memory and out-of-core processing.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 71 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95
◮ Immutable collections of objects spread across a cluster. ◮ An RDD is divided into a number of partitions. ◮ Partitions of an RDD can be stored on different nodes of a cluster.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 74 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 75 / 95
◮ Many applications must process large streams of live data and pro-
vide results in real-time.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
◮ Many applications must process large streams of live data and pro-
vide results in real-time.
◮ Processing information as it flows, without storing them persistently.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
◮ Many applications must process large streams of live data and pro-
vide results in real-time.
◮ Processing information as it flows, without storing them persistently. ◮ Traditional DBMSs:
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
◮ DBMS: persistent data where updates are relatively infrequent. ◮ DSMS: transient data that is continuously updated.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 77 / 95
◮ DBMS: runs queries just once to return a complete answer. ◮ DSMS: executes standing queries, which run continuously and pro-
vide updated answers as new data arrives.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 78 / 95
◮ Despite these differences, DSMSs resemble DBMSs: both process
incoming data through a sequence of transformations based on SQL
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 79 / 95
◮ Source: produces the incoming information flows ◮ Sink: consumes the results of processing ◮ IFP engine: processes incoming flows ◮ Processing rules: how to process the incoming flows ◮ Rule manager: adds/removes processing rules
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 80 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 81 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 82 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 83 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 84 / 95
◮ Large graphs need large-scale processing. ◮ A large graph either cannot fit into memory of single computer or
it fits with huge cost.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 85 / 95
◮ The platforms that have worked well for developing parallel applica-
tions are not necessarily effective for large-scale graph problems.
◮ Why?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 86 / 95
◮ Unstructured problems: difficult to partition the data ◮ Data-driven computations: difficult to partition computation ◮ Poor data locality ◮ High data access to computation ratio
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 87 / 95
Graph-Parallel Processing ◮ Computation typically depends on the neighbors.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 88 / 95
◮ Restricts the types of computation. ◮ New techniques to partition and distribute graphs. ◮ Exploit graph structure. ◮ Executes graph algorithms orders-of-magnitude faster than more
general data-parallel systems.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 89 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 90 / 95
◮ Think as a vertex. ◮ Each vertex computes individually its value: in parallel ◮ Each vertex can see its local context, and updates its value accord-
ingly.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 91 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 92 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 93 / 95
◮ Scale-out vs. Scale-up
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
◮ Scale-out vs. Scale-up ◮ How to store data?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
◮ Scale-out vs. Scale-up ◮ How to store data?
◮ How to process data?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 95 / 95