data intensive computing frameworks
play

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se - PowerPoint PPT Presentation

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95 Big Data small data big data Amir H. Payberah (AUT) Data


  1. NoSQL ◮ Avoidance of unneeded complexity ◮ High throughput ◮ Horizontal scalability and running on commodity hardware Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 45 / 95

  2. NoSQL Cost and Performance [http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 46 / 95

  3. RDBMS vs. NoSQL [http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 47 / 95

  4. NoSQL Data Models [http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 48 / 95

  5. NoSQL Data Models: Key-Value ◮ Collection of key/value pairs. ◮ Ordered Key-Value: processing over key ranges. ◮ Dynamo, Scalaris, Voldemort, Riak, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 49 / 95

  6. NoSQL Data Models: Column-Oriented ◮ Similar to a key/value store, but the value can have multiple at- tributes (Columns). ◮ Column: a set of data values of a particular type. ◮ BigTable, Hbase, Cassandra, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 50 / 95

  7. NoSQL Data Models: Document-Based ◮ Similar to a column-oriented store, but values can have complex documents, e.g., XML, YAML, JSON, and BSON. ◮ CouchDB, MongoDB, ... { FirstName: "Bob", Address: "5 Oak St.", Hobby: "sailing" } { FirstName: "Jonathan", Address: "15 Wanamassa Point Road", Children: [ {Name: "Michael", Age: 10}, {Name: "Jennifer", Age: 8}, ] } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 51 / 95

  8. NoSQL Data Models: Graph-Based ◮ Uses graph structures with nodes, edges, and properties to represent and store data. ◮ Neo4J, InfoGrid, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 52 / 95

  9. Data Processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 53 / 95

  10. Challenges ◮ How to distribute computation? ◮ How can we make it easy to write distributed programs? ◮ Machines failure. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 54 / 95

  11. Idea ◮ Issue: • Copying data over a network takes time. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95

  12. Idea ◮ Issue: • Copying data over a network takes time. ◮ Idea: • Bring computation close to the data. • Store files multiple times for reliability. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95

  13. MapReduce ◮ A shared nothing architecture for processing large data sets with a parallel/distributed algorithm on clusters. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 56 / 95

  14. Simplicity ◮ Don’t worry about parallelization, fault tolerance, data distribution, and load balancing (MapReduce takes care of these). ◮ Hide system-level details from programmers. Simplicity! Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 57 / 95

  15. Warm-up Task (1/2) ◮ We have a huge text document. ◮ Count the number of times each distinct word appears in the file Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 58 / 95

  16. Warm-up Task (2/2) ◮ File too large for memory, but all � word , count � pairs fit in memory. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

  17. Warm-up Task (2/2) ◮ File too large for memory, but all � word , count � pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c • where words takes a file and outputs the words in it, one per a line Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

  18. Warm-up Task (2/2) ◮ File too large for memory, but all � word , count � pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c • where words takes a file and outputs the words in it, one per a line ◮ It captures the essence of MapReduce: great thing is that it is naturally parallelizable. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

  19. MapReduce Overview ◮ words(doc.txt) | sort | uniq -c Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

  20. MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

  21. MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

  22. MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

  23. MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

  24. MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. ◮ Write the result. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

  25. Example: Word Count ◮ Consider doing a word count of the following file using MapReduce: Hello World Bye World Hello Hadoop Goodbye Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 61 / 95

  26. Example: Word Count - map ◮ The map function reads in words one a time and outputs (word, 1) for each parsed input word. ◮ The map function output is: (Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 62 / 95

  27. Example: Word Count - shuffle ◮ The shuffle phase between map and reduce phase creates a list of values associated with each key. ◮ The reduce function input is: (Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1)) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 63 / 95

  28. Example: Word Count - reduce ◮ The reduce function sums the numbers in the list for each key and outputs (word, count) pairs. ◮ The output of the reduce function is the output of the MapReduce job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 64 / 95

  29. Example: Word Count - map public static class MyMap extends Mapper<...> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 65 / 95

  30. Example: Word Count - reduce public static class MyReduce extends Reducer<...> { public void reduce(Text key, Iterator<...> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) sum += values.next().get(); context.write(key, new IntWritable(sum)); } } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 66 / 95

  31. Example: Word Count - driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MyMap.class); job.setReducerClass(MyReduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 67 / 95

  32. MapReduce Execution J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, ACM Communications 51(1), 2008. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 68 / 95

  33. MapReduce Weaknesses ◮ MapReduce programming model has not been designed for complex operations, e.g., data mining. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95

  34. MapReduce Weaknesses ◮ MapReduce programming model has not been designed for complex operations, e.g., data mining. ◮ Very expensive, i.e., always goes to disk and HDFS. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95

  35. Solution? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 70 / 95

  36. Spark ◮ Extends MapReduce with more operators. ◮ Support for advanced data flow graphs. ◮ In-memory and out-of-core processing. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 71 / 95

  37. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95

  38. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95

  39. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95

  40. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95

  41. Resilient Distributed Datasets (RDD) ◮ Immutable collections of objects spread across a cluster. ◮ An RDD is divided into a number of partitions. ◮ Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 74 / 95

  42. What About Streaming Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 75 / 95

  43. Motivation ◮ Many applications must process large streams of live data and pro- vide results in real-time. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

  44. Motivation ◮ Many applications must process large streams of live data and pro- vide results in real-time. ◮ Processing information as it flows, without storing them persistently. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

  45. Motivation ◮ Many applications must process large streams of live data and pro- vide results in real-time. ◮ Processing information as it flows, without storing them persistently. ◮ Traditional DBMSs: • Store and index data before processing it. • Process data only when explicitly asked by the users. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

  46. DBMS vs. DSMS (1/3) ◮ DBMS: persistent data where updates are relatively infrequent. ◮ DSMS: transient data that is continuously updated. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 77 / 95

  47. DBMS vs. DSMS (2/3) ◮ DBMS: runs queries just once to return a complete answer. ◮ DSMS: executes standing queries, which run continuously and pro- vide updated answers as new data arrives. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 78 / 95

  48. DBMS vs. DSMS (3/3) ◮ Despite these differences, DSMSs resemble DBMSs: both process incoming data through a sequence of transformations based on SQL operators, e.g., selections, aggregates, joins. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 79 / 95

  49. DSMS ◮ Source: produces the incoming information flows ◮ Sink: consumes the results of processing ◮ IFP engine: processes incoming flows ◮ Processing rules: how to process the incoming flows ◮ Rule manager: adds/removes processing rules Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 80 / 95

  50. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 81 / 95

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend