m ap r educe
play

[M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019]


  1. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] September 26, 2019 L10.1 Dept. Of Computer Science , Colorado State University Frequently asked questions from the previous class survey L10. 2 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.1 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  2. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Topics covered in this lecture ¨ MapReduce L10. 3 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA M AP R EDUCE CS555: Distributed Systems [Fall 2019] September 26, 2019 L10.4 Dept. Of Computer Science , Colorado State University L10.2 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  3. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University MapReduce: Topics that we will cover ¨ Why? ¨ What it is and what it is not? ¨ The core framework and original Google paper ¨ Development of simple programs using Hadoop ¤ The dominant MapReduce implementation L10. 5 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA MapReduce ¨ It’s a framework for processing data residing on a large number of computers ¨ Very powerful framework ¤ Excellent for some problems ¤ Challenging or not applicable in other classes of problems L10. 6 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.3 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  4. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University What is MapReduce? ¨ More a framework than a tool ¨ You are required to fit (some folks shoehorn it) your solution into the MapReduce framework ¨ MapReduce is not a feature, but rather a constraint L10. 7 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA What does this constraint mean? ¨ It makes problem solving easier and harder ¨ Clear boundaries for what you can and cannot do ¤ You actually need to consider fewer options than what you are used to ¨ But solving problems with constraints requires planning and a change in your thinking L10. 8 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.4 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  5. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University But what does this get us? ¨ Tradeoff of being confined to the MapReduce framework? ¤ Ability to process data on a large number of computers ¤ But, more importantly, without having to worry about concurrency, scale, fault tolerance, and robustness L10. 9 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA A challenge in writing MapReduce programs ¨ Design ! ¤ Good programmers can produce bad software due to poor design ¤ Good programmers can produce bad MapReduce algorithms ¨ Only in this case your mistakes will be amplified ¤ Your job may be distributed on 100s or 1000s of machines and operating on a Petabyte of data L10. 10 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.5 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  6. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University MapReduce: Origins of the design ¨ Process crawled data and logs of web requests ¨ Several computations work on this raw data to compute derived data ¤ Inverted indices ¤ Representation of graph structure of web documents ¤ Pages crawled per host ¤ Most frequent queries in a day … L10. 11 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Most computations are conceptually straightforward ¨ But data is large ¨ Computations must be scalable ¤ Distributed across thousands of machines ¤ To complete in a reasonable amount of time L10. 12 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.6 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  7. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Complexity of managing distributed computations can … ¨ Obscure simplicity of original computation ¨ Contributing factors: ① How to parallelize computation ② Distribute the data ③ Handle failures L10. 13 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA MapReduce was developed to cope with this complexity ¨ Express simple computations ¨ Hide messy details of ¤ Parallelization ¤ Data distribution ¤ Fault tolerance ¤ Load balancing L10. 14 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.7 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  8. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University MapReduce ¨ Programming model ¨ Associated implementation for ¤ Processing & Generating large data sets L10. 15 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Programming model ¨ Computation takes a set of input key/value pairs ¨ Produces a set of output key/value pairs ¨ Express the computation as two functions: ¤ Map ¤ Reduce L10. 16 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.8 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  9. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Map ¨ Takes an input pair ¨ Produces a set of intermediate key/value pairs L10. 17 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA MapReduce library ¨ Groups all intermediate values with the same intermediate key ¨ Passes them to the Reduce function L10. 18 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.9 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  10. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Reduce function ¨ Accepts intermediate key I and ¤ Set of value s for that key ¨ Merge these value s together to get ¤ Smaller set of value s L10. 19 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Counting number occurrences of each word in a large collection of documents map (String key, String value) //key: document name //value: document contents for each word w in value EmitIntermediate( w , “ 1 ”) L10. 20 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.10 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  11. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Counting number occurrences of each word in a large collection of documents reduce (String key, Iterator values) //key: a word //value: a list of counts int result = 0; for each v in values result += ParseInt( v ); Emit(AsString( result result )); Sums together all counts emitted for a particular word L10. 21 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA MapReduce specification object contains ¨ Names of ¤ Input ¤ Output ¨ Tuning parameters L10. 22 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.11 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

  12. CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Map and reduce functions have associated types drawn from different domains map map (k1, v1) à list(k2, v2) reduce (k2, list(v2)) à list(v2) reduce L10. 23 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA What’s passed to-and-from user-defined functions ¨ String s ¨ User code converts between ¤ String ¤ Appropriate type s L10. 24 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.12 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend