MapReduce Reduce Introdu duction ion and Hadoop p Overvie view - PowerPoint PPT Presentation

13 June 2012 MapReduce Reduce Introdu duction ion and Hadoop p Overvie view Lab Course: Databases & Cloud Computi ting ng SS SS 2012 2012 Mart rtin in Przyja zyjaciel el-Zablo Zablocki ki Alexa lexander er Schä hätz tzle le Georg org Laus usen en University of Freiburg Databases & Information Systems

Age gend nda 1. Why MapReduce? a. Data Management Approaches b. Distributed Processing of Data 2. Apache Hadoop a. HDFS b. MapReduce c. Pig d. Hive e. HBase 3. Programming MapReduce a. MapReduce with Java b. Moving into the Cloud 0. Agenda nda MapReduce Introduction 2

MapRedu pReduce: ce: Why hy?  Large ge datasets sets ◦ Amount of data increases constantly ◦ “Five exabytes of data are generated every to days” (corresponds to the whole amount of data generated up to 2003) by Eric Schmidt  Fac acebo ebook: ok: ◦ >800 million active users in Facebook, interacting with >1 billion objects ◦ 2.5 petabytes of userdata per day!  How to explore, ore, analyze yze such ch large e datasets sets? 1. MapReduc pReduce MapReduce Introduction 3 Data a Manageme ement nt

MapRedu pReduce: ce: Why hy?  Proce cessi ssing ng 100 TB d datase set  On 1 node ◦ Scanning @ 50 MB/s = 23 3 days  On 1000 node cluster ◦ Scanning @ 50 MB/s = 33 3 min in  Curren ent t development elopment ◦ Companies often can't cope with logged user behavior and throw away data after some time  lost opportunities ◦ Growing cloud-computing capacities ◦ Price/performance advantage of low-end servers increases to about a factor of twelve 1. MapReduc pReduce MapReduce Introduction 4 Data a Manageme ement nt

Data ta Mana nageme gement nt Approache roaches  High gh-per perfo forman rmance e single gle machi achines es ◦ “Scale - up” with limits (hardware, software, costs) ◦ Workloads today are beyond the capacity of any single machine ◦ I/O Bottleneck  Parall llel l Databas abases es ◦ Fast and reliable ◦ “Scale -out ” restricted to some hundreds machines ◦ Maintaining & administrations of parallel databases is hard  Specia eciali lized zed clust uster er of f power owerful l machi chines es ◦ “specialized” = powerful hardware satisfying individual software needs ◦ fast and reliable but also very expensive ◦ For data- intensive applications: scaling “out” is superior to scaling “up”  performance gab insufficient to justify the price 1. 1. MapReduc pReduce MapReduce Introduction 5 Data a Manageme ement nt

Data ta Mana nageme gement nt Approache roaches s (2)  Clusters usters of commo ommodity dity servers vers (wit ith h MapRe pReduce) duce) “ Commodity ty servers” = not individually adjusted ◦ e.g. 8 cores, 16G of RAM ◦ Cost & energy efficiency MapRe Reduce uce ◦ Designed around clusters of commodity servers ◦ Widely used in a broad range of applications ◦ By many organizations ◦ Scaling ”out”, e.g. Yahoo! uses > 40.000 machines ◦ Easy to maintain & administrate 1. MapReduc 1. pReduce MapReduce Introduction 6 Data a Manageme ement nt

Distributed tributed Proc ocessing ssing of of Da Data ta  Proble lem: m: How to compute the PageRank for a crawled set of websites on a cluster of machines? MapReduce!  Main in Challen hallenges: ges: ◦ How to break up a large problem into smaller tasks, that can be executed in parallel? ◦ How to assign tasks to machines? ◦ How to partition and distribute data? ◦ How to share intermediate results? ◦ How to coordinate synchronization, scheduling, fault-tolerance? 1. MapReduc pReduce MapReduce Introduction 7 Dist stribut uted ed Proce cessi ssing ng

Big g ideas eas behi hind nd MapRedu pReduce ce  Scale “out”, not “up” ◦ Large number of commodity servers  Assum ume failu lures res are commo ommon ◦ In a cluster of 10000 servers, expect 10 failures a day.  Move e proce cessin sing g to the data ◦ Take advantage of data locality and avoid to transfer large datasets through the network  Process cess data a sequentia quentiall lly and nd avo void id rando dom access cess ◦ Random disk access causes seek times  Hide e system em-lev evel el detail ails from om the appli licat atio ion n developer veloper ◦ Developers can focus on their problems instead of dealing with distributed programming issues  Seam amless less scala alabil bilit ity ◦ Scaling “out” improves the performance of an algorithm without any modifications 1. MapReduc pReduce MapReduce Introduction 8 Dist stribut uted ed Proce cessi ssing ng

MapRedu pReduce ce  MapReduc educe ◦ Popularized by Google & widely used ◦ Algorithms that can be expressed as (or mapped to) a sequence of Map() and Reduce() functions are automatically parallelized by the framework  Distr tributed ibuted File e System tem ◦ Data is split into equally sized blocks and stored distributed ◦ Clusters of commodity hardware  Fault tolerance by replication ◦ Very large files / write-once, read-many pattern  Advant ntage ages ◦ Partitioning + distribution of data That all is done ◦ Parallelization and assigning of task automatically! ◦ Scalability, fault- tolerance, scheduling,… 1. MapReduc 1. pReduce MapReduce Introduction 9 Dist stribut uted ed Proce cessi ssing ng

Wha hat t is MapRedu pReduce ce us used d fo for?  At Google gle ◦ Index construction for Google Search (replaced in 2010 by Caffeine) ◦ Article clustering for Google News ◦ Statistical machine translation  At Yahoo oo! ◦ “Web map” powering Yahoo! Search ◦ Spam detection for Yahoo! Mail  At Face cebook book ◦ Data mining, Web log processing ◦ SearchBox (with Cassandra) ◦ Facebook Messages (with HBase) ◦ Ad optimization ◦ Spam detection 1. 1. MapReduc pReduce MapReduce Introduction 10 Dist stribut uted ed Proce cessi ssing ng

Wha hat t is MapRedu pReduce ce us used d fo for?  In resear arch ch ◦ Astronomical image analysis (Washington) ◦ Bioinformatics (Maryland) ◦ Analyzing Wikipedia conflicts (PARC) ◦ Natural language processing (CMU) ◦ Particle physics (Nebraska) ◦ Ocean climate simulation (Washington) ◦ Processing of Semantic Data (Freiburg) ◦ <Your application here> 1. MapReduc 1. pReduce MapReduce Introduction 11 Dist stribut uted ed Proce cessi ssing ng

Age gend nda 1. Why MapReduce? a. Data Management Approaches b. Distributed Processing of Data 2. Apache Hadoop a. HDFS b. MapReduce c. Pig d. Hive e. HBase 3. Programming MapReduce a. MapReduce with Java b. Moving into the Cloud 0. Agenda nda MapReduce Introduction 12

2. Ap Apache che Hadoop oop “Open -source software for reliable, scalable, distributed computing” 2. Hado doop MapReduce Introduction 13

Apac ache he Ha Hadoop: oop: Why hy?  Apache che Hadoop oop ◦ Well-known Open-Source implementation of ◦ Google’s MapReduce & Google File System (GFS) paper ◦ Enriched by many subprojects ◦ Used by Yahoo, Facebook, Amazon, IBM, Last.fm , EBay … ◦ Cloudera’s Distribution with VMWare images, tutorials and further patches 2. Hado doop MapReduce Introduction 14

Ha Hadoop oop Ecosy osystem stem PIG Hive Chukwa (Data Flow) (SQL) (Managing) ZooKeeper MapReduce (Coordination) (Serialization) Avro (Job Scheduling/Execution System) HBase (NoSQL) HDFS (Hadoop Distributed File System) Hadoop Common (supporting utilities, libraries) 2. Hado doop MapReduce Introduction 15

Yahoo’s Hadoop Cluster Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf 2. Hado doop MapReduce Introduction 16

Ha Hadoop oop Distributed tributed File le System tem  Files split into 64MB blocks NameNode File1  Blocks replicated across several 1 DataNodes (usually 3) 2 3 4  Single NameNode stores metadata ◦ file names, block locations, etc  Optimized for large files, sequential reads 1 2 1 3  Files are append-only 2 1 4 2 4 3 3 4 DataNodes 2. Hado doop MapReduce Introduction 17 HDFS

Hadoop Ha oop Archite hitectu cture re  Master & Slaves architecture JobTracker + Master NameNode  JobTracker schedules and manages jobs  TaskTracker executes individual map() and reduce() task on each cluster node  JobTracker and Namenode as well as TaskTrackers and TaskTracker TaskTracker TaskTracker DataNodes are placed on the + + + DataNode DataNode DataNode same machines Slaves 2. Hado doop MapReduce Introduction 18 Archi hitec ecture ure

MapRedu pReduce ce Wor orkflow kflow (1) Map Phase se ◦ Raw data read and converted to key/value pairs ◦ Map() function applied to any pair (2) Shuff ffle le Phase ase ◦ All key/value pairs are sorted and grouped by their keys (3) Reduce ce Phas ase ◦ All values with a the same key are processed by within the same reduce() function 2. Hado doop MapReduce Introduction 19 MapR pRed educe uce

MapReduce Reduce Introdu duction ion and Hadoop p Overvie view - PowerPoint PPT Presentation

13 June 2012 MapReduce Reduce Introdu duction ion and Hadoop p Overvie view Lab Course: Databases & Cloud Computi ting ng SS SS 2012 2012 Mart rtin in Przyja zyjaciel el-Zablo Zablocki ki Alexa lexander er Sch htz

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Hadoop Map Reduce 01/18/2018 1 MapReduce 2-in-1 A programming paradigm A query execution

Mapreduce With Parallelizable Reduce S. Muthu Muthukrishnan Some Premises

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

CSC 369: Distributed Computing Alex Dekhtyar Day 1: Welcome Syllabus Teaching and

Parallel Distributed Processing: Further Explorations in the Microstructure of Cognition

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Distributed Computing In IceCube David Schultz, Gonzalo Merino, Vladimir Brik, and Jan Oertlin

Distributed Graph Processing Lecture 13 CSCI 4974/6971 17 Oct 2016 1 / 9 Todays Biz 1.

Communication Complexity of Learning Discrete Distributions Krzysztof Onak IBM T.J. Watson

Lecture 13 : The Exponential Distribution 0/ 19 Definition A continuous random variable X is

Security proofs for continuous-variable quantum key distribution Anthony Leverrier Inria Paris

MapReduce Reduce Introdu duction ion and Hadoop p Overvie view - PowerPoint PPT Presentation

13 June 2012 MapReduce Reduce Introdu duction ion and Hadoop p Overvie view Lab Course: Databases & Cloud Computi ting ng SS SS 2012 2012 Mart rtin in Przyja zyjaciel el-Zablo Zablocki ki Alexa lexander er Sch htz

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Hadoop Map Reduce 01/18/2018 1 MapReduce 2-in-1 A programming paradigm A query execution

Mapreduce With Parallelizable Reduce S. Muthu Muthukrishnan Some Premises

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

CSC 369: Distributed Computing Alex Dekhtyar Day 1: Welcome Syllabus Teaching and

Parallel Distributed Processing: Further Explorations in the Microstructure of Cognition

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Distributed Computing In IceCube David Schultz, Gonzalo Merino, Vladimir Brik, and Jan Oertlin

Distributed Graph Processing Lecture 13 CSCI 4974/6971 17 Oct 2016 1 / 9 Todays Biz 1.

Communication Complexity of Learning Discrete Distributions Krzysztof Onak IBM T.J. Watson

Lecture 13 : The Exponential Distribution 0/ 19 Definition A continuous random variable X is

Security proofs for continuous-variable quantum key distribution Anthony Leverrier Inria Paris

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the