Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se - PowerPoint PPT Presentation

NoSQL ◮ Avoidance of unneeded complexity ◮ High throughput ◮ Horizontal scalability and running on commodity hardware Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 45 / 95

NoSQL Cost and Performance [http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 46 / 95

RDBMS vs. NoSQL [http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 47 / 95

NoSQL Data Models [http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 48 / 95

NoSQL Data Models: Key-Value ◮ Collection of key/value pairs. ◮ Ordered Key-Value: processing over key ranges. ◮ Dynamo, Scalaris, Voldemort, Riak, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 49 / 95

NoSQL Data Models: Column-Oriented ◮ Similar to a key/value store, but the value can have multiple at- tributes (Columns). ◮ Column: a set of data values of a particular type. ◮ BigTable, Hbase, Cassandra, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 50 / 95

NoSQL Data Models: Document-Based ◮ Similar to a column-oriented store, but values can have complex documents, e.g., XML, YAML, JSON, and BSON. ◮ CouchDB, MongoDB, ... { FirstName: "Bob", Address: "5 Oak St.", Hobby: "sailing" } { FirstName: "Jonathan", Address: "15 Wanamassa Point Road", Children: [ {Name: "Michael", Age: 10}, {Name: "Jennifer", Age: 8}, ] } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 51 / 95

NoSQL Data Models: Graph-Based ◮ Uses graph structures with nodes, edges, and properties to represent and store data. ◮ Neo4J, InfoGrid, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 52 / 95

Data Processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 53 / 95

Challenges ◮ How to distribute computation? ◮ How can we make it easy to write distributed programs? ◮ Machines failure. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 54 / 95

Idea ◮ Issue: • Copying data over a network takes time. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95

Idea ◮ Issue: • Copying data over a network takes time. ◮ Idea: • Bring computation close to the data. • Store files multiple times for reliability. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95

MapReduce ◮ A shared nothing architecture for processing large data sets with a parallel/distributed algorithm on clusters. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 56 / 95

Simplicity ◮ Don’t worry about parallelization, fault tolerance, data distribution, and load balancing (MapReduce takes care of these). ◮ Hide system-level details from programmers. Simplicity! Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 57 / 95

Warm-up Task (1/2) ◮ We have a huge text document. ◮ Count the number of times each distinct word appears in the file Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 58 / 95

Warm-up Task (2/2) ◮ File too large for memory, but all � word , count � pairs fit in memory. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

Warm-up Task (2/2) ◮ File too large for memory, but all � word , count � pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c • where words takes a file and outputs the words in it, one per a line Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

Warm-up Task (2/2) ◮ File too large for memory, but all � word , count � pairs fit in memory. ◮ words(doc.txt) | sort | uniq -c • where words takes a file and outputs the words in it, one per a line ◮ It captures the essence of MapReduce: great thing is that it is naturally parallelizable. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95

MapReduce Overview ◮ words(doc.txt) | sort | uniq -c Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

MapReduce Overview ◮ words(doc.txt) | sort | uniq -c ◮ Sequentially read a lot of data. ◮ Map: extract something you care about. ◮ Group by key: sort and shuffle. ◮ Reduce: aggregate, summarize, filter or transform. ◮ Write the result. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95

Example: Word Count ◮ Consider doing a word count of the following file using MapReduce: Hello World Bye World Hello Hadoop Goodbye Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 61 / 95

Example: Word Count - map ◮ The map function reads in words one a time and outputs (word, 1) for each parsed input word. ◮ The map function output is: (Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 62 / 95

Example: Word Count - shuffle ◮ The shuffle phase between map and reduce phase creates a list of values associated with each key. ◮ The reduce function input is: (Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1)) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 63 / 95

Example: Word Count - reduce ◮ The reduce function sums the numbers in the list for each key and outputs (word, count) pairs. ◮ The output of the reduce function is the output of the MapReduce job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 64 / 95

Example: Word Count - map public static class MyMap extends Mapper<...> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 65 / 95

Example: Word Count - reduce public static class MyReduce extends Reducer<...> { public void reduce(Text key, Iterator<...> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) sum += values.next().get(); context.write(key, new IntWritable(sum)); } } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 66 / 95

Example: Word Count - driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MyMap.class); job.setReducerClass(MyReduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 67 / 95

MapReduce Execution J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, ACM Communications 51(1), 2008. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 68 / 95

MapReduce Weaknesses ◮ MapReduce programming model has not been designed for complex operations, e.g., data mining. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95

MapReduce Weaknesses ◮ MapReduce programming model has not been designed for complex operations, e.g., data mining. ◮ Very expensive, i.e., always goes to disk and HDFS. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95

Solution? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 70 / 95

Spark ◮ Extends MapReduce with more operators. ◮ Support for advanced data flow graphs. ◮ In-memory and out-of-core processing. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 71 / 95

Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95

Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95

Resilient Distributed Datasets (RDD) ◮ Immutable collections of objects spread across a cluster. ◮ An RDD is divided into a number of partitions. ◮ Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 74 / 95

What About Streaming Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 75 / 95

Motivation ◮ Many applications must process large streams of live data and provide results in real-time. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

Motivation ◮ Many applications must process large streams of live data and provide results in real-time. ◮ Processing information as it flows, without storing them persistently. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

Motivation ◮ Many applications must process large streams of live data and provide results in real-time. ◮ Processing information as it flows, without storing them persistently. ◮ Traditional DBMSs: • Store and index data before processing it. • Process data only when explicitly asked by the users. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95

DBMS vs. DSMS (1/3) ◮ DBMS: persistent data where updates are relatively infrequent. ◮ DSMS: transient data that is continuously updated. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 77 / 95

DBMS vs. DSMS (2/3) ◮ DBMS: runs queries just once to return a complete answer. ◮ DSMS: executes standing queries, which run continuously and provide updated answers as new data arrives. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 78 / 95

DBMS vs. DSMS (3/3) ◮ Despite these differences, DSMSs resemble DBMSs: both process incoming data through a sequence of transformations based on SQL operators, e.g., selections, aggregates, joins. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 79 / 95

DSMS ◮ Source: produces the incoming information flows ◮ Sink: consumes the results of processing ◮ IFP engine: processes incoming flows ◮ Processing rules: how to process the incoming flows ◮ Rule manager: adds/removes processing rules Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 80 / 95

Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 81 / 95

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se - PowerPoint PPT Presentation

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95 Big Data small data big data Amir H. Payberah (AUT) Data

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Web Frameworks Web Frameworks Banned for homework assignments Now that you're starting

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Turning Data Into Business Value Qwertee 101: Finding Your Next Data Partner Data-Intensive

Frameworks for Data-Intensive Applications By Ahmad and Cheung Presented by: Ishank Jain

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Applications on Numerically-Intensive Supercomputers David Daniel / James Ahrens

Extreme Data-Intensive Scientific Computing Alex Szalay JHU Big Data in Science Data

Data-Intensive Research in Education: NSF Initiatives in Big Data and Data Science Chris

What is Advanced Research Computing? Data Supercomputing Computationally Mining Intensive

Plugin frameworks About me About this talk Plugin 3 approaches to designing plugin APIs

Next Steps: Finishing Strong @ SC4 & Applying to College Friday, September 11, 2020 BWMCA

Introduction 2 3 Syllabus 4 Worksheet for your questions 5 Course Website 6 Scriptures

The naked iron knee of the cosmic radiation, its necessity and empirical evidence Antonio Codino

G R A P H I C D E S I G N MPA 635: Data Visualization September 11, 2018 P L A N F O R T O D

Proof and discovery Manuscripts bifiolio = sheet of parchment folio = physical leaf when folded

Session Speaker Watch Dr. McGlaughlin's keynote on Sustainable Competitive Advantage in 2015

Problems with early language systems: Complicated Combinations of Pictographs, Rebus and

Discovered at Wadi Qumran in the 1940s and 1950s Contained both Biblical and

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se - PowerPoint PPT Presentation

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95 Big Data small data big data Amir H. Payberah (AUT) Data

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Web Frameworks Web Frameworks Banned for homework assignments Now that you're starting

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Turning Data Into Business Value Qwertee 101: Finding Your Next Data Partner Data-Intensive

Frameworks for Data-Intensive Applications By Ahmad and Cheung Presented by: Ishank Jain

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Applications on Numerically-Intensive Supercomputers David Daniel / James Ahrens

Extreme Data-Intensive Scientific Computing Alex Szalay JHU Big Data in Science Data

Data-Intensive Research in Education: NSF Initiatives in Big Data and Data Science Chris

What is Advanced Research Computing? Data Supercomputing Computationally Mining Intensive

Plugin frameworks About me About this talk Plugin 3 approaches to designing plugin APIs

Next Steps: Finishing Strong @ SC4 &amp; Applying to College Friday, September 11, 2020 BWMCA

Introduction 2 3 Syllabus 4 Worksheet for your questions 5 Course Website 6 Scriptures

The naked iron knee of the cosmic radiation, its necessity and empirical evidence Antonio Codino

G R A P H I C D E S I G N MPA 635: Data Visualization September 11, 2018 P L A N F O R T O D

Proof and discovery Manuscripts bifiolio = sheet of parchment folio = physical leaf when folded

Session Speaker Watch Dr. McGlaughlin's keynote on Sustainable Competitive Advantage in 2015

Problems with early language systems: Complicated Combinations of Pictographs, Rebus and

Discovered at Wadi Qumran in the 1940s and 1950s Contained both Biblical and

Next Steps: Finishing Strong @ SC4 & Applying to College Friday, September 11, 2020 BWMCA