Hadoop Performance Evaluation Praktikum fr Fortgeschrittene Name: - PowerPoint PPT Presentation

Hadoop Performance Evaluation Praktikum für Fortgeschrittene Name: Tien Duc Dinh Betreuer: Olga Mordvinova, Julian Kunkel Datum: 04-12-2007

Outline Introduction 1. Motivation  Basic notations  HDFS Overview 1. Architecture  MapReduce  HDFS Performance 1. Test Scenarios  Write  Read  Comparison with local FS  2

What is Hadoop ? Hadoop is an open-source, Java-based programming Outline  framework – Apache project 1. Introduction o Motivation supports the processing of large data sets in a distributed  o Basic notations computing environment was inspired by Google MapReduce and Google File  2. Overview System (GFS) o Architectur currently used by many famous IT enterprises, e.g. o MapReduce  Google, Yahoo, IBM 2. Performance o Test scenarios o Write o Read o Comparison with local FS 3

Basic notations HDFS = Hadoop Distributed File System Outline  Distributed file system  1. Introduction – contains mechanisms for job scheduling/execution o Motivation – for instance allows to move jobs to data o Basic notations 2. Overview Job/Task = MapReduce job/task  o Architectur Metadata  o MapReduce – data, which consist of other data information 2. Performance – e.g. file name, block location o Test scenarios Block  o Write o Read – part of a logical file o Comparison with – contiguous data stored on one server local FS – 64 MB default 4 – configurable

HDFS Overview XXXXXXXX Outline get job Secondary Queue JobTracker Namenode 1. Introduction o Motivation metadata request submit Namenode Metadata o Basic notations job metadata response 2. Overview o Architectur Client o p - r e q u e s t o MapReduce o p - Datanode Datanode r e s p o n s 2. Performance e TaskTracker TaskTracker o Test scenarios Filesystem Filesystem o Write o Read o Comparison with local FS 5 5

Client get job Secondary Queue JobTracker Namenode Outline metadata request submit Namenode Metadata job metadata response 1. Introduction o Motivation o Client p - r e q u e s t o Basic notations Datanode Datanode op-response TaskTracker TaskTracker 2. Overview Filesystem Filesystem o Architectur o MapReduce 2. Performance - is an api of a HDFS application o Test scenarios o Write - communicates with the Namenode because of metadata and directly runs the operation on Datanodes o Read o Comparison with - if it’s a MapReduce operation, client creates an job and send it into the queue. local FS JobTracker handles this queue 6

Namenode get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce - is the master server which manages all system metadata like the namespace, 2. Performance access control information, mapping from files to chunks and chunk locations o Test scenarios executes file system namespace operations like opening, closing, renaming files o Write and directories o Read o Comparison with - gives instructions to the Datanodes to perform system operations, e.g. block local FS creation, deletion and replication - having only one Namenode simplifies the design 7

Datanode get job Secondary Queue JobTracker Namenode Outline metadata request submit Namenode Metadata job metadata response 1. Introduction o Motivation o Client p - r e q u e s t o Basic notations Datanode Datanode op-response TaskTracker TaskTracker 2. Overview Filesystem Filesystem o Architectur o MapReduce 2. Performance - one per node o Test scenarios o Write - stores HDFS data in its local file system o Read - performs operations by clients and system operations upon instruction from the o Comparison with Namenode local FS 8

Secondary Namenode get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce - modifications to the file system are stored as a log file by the Namenode 2. Performance - while starting up, the Namenode reads the HDFS state from an image file o Test scenarios (fsimage) and then applies modifications from the log file o Write o Read - after the Namenode finished writing the new HDFS state to the image file, it o Comparison with empties the log file local FS - merges fsimage and the log file periodically and keeps the log size within a limit 9

TaskTracker get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce - is a node in the cluster that accepts MapReduce tasks from the JobTracker 2. Performance - is configured with a set of slots, these indicate the number of tasks that it can o Test scenarios accept o Write - spawns a separate JVM processes to do the actual work, this helps to ensure that o Read process failure does not take down the TaskTracker o Comparison with local FS - monitors the processes and reports their state to the JobTracker 10 - contacts to the JobTracker through heartbeat meassages

JobTracker (1) get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce 2. Performance - is the MapReduce master o Test scenarios - runs normally on a separate node o Write - uses a queue for the IO scheduling o Read - talks to the NameNode to determine the location of the data o Comparison with local FS - submits the work to the chosen TaskTracker nodes and monitors them through 11 heartbeat meassages in a time interval

JobTracker (2) get job Secondary Queue JobTracker Namenode submit metadata request Namenode Metadata Outline job metadata response 1. Introduction o Client p - r e q u e s t o Motivation Datanode Datanode op-response o Basic notations TaskTracker TaskTracker Filesystem Filesystem 2. Overview o Architectur o MapReduce 2. Performance - if a task is failed, it may resubmitted elsewhere o Test scenarios - when the work is completed, the JobTracker updates its status o Write - Client applications can poll the JobTracker for information o Read o Comparison with - JobTracker is a single point of failure for the Map/Reduce infrastructure. If it goes local FS down, all running jobs are lost. The fileystem remains live 12 - t here is currently no checkpointing or recovery within a single map/reduce job

MapReduce (1) Outline Is a programming model and an associated implementation  for processing and generating large data sets 1. Introduction Its functions map and reduce are supplied by the user  o Motivation Map  o Basic notations – process a key/value pair to generate a set of intermediate key/value pairs 2. Overview – group together all intermediate values with the same key and pass them o Architectur to the Reducer o MapReduce Reduce  2. Performance – XXXXXXXXXXXXXXX o Test scenarios o Write o Read o Comparison with local FS 13

MapReduce (2) Outline 1. Introduction o Motivation o Basic notations 2. Overview o Architectur o MapReduce 2. Performance o Test scenarios o Write o Read o Comparison with local FS 14

MapReduce (3) Outline 1. Introduction o Motivation o Basic notations 2. Overview o Architectur o MapReduce 2. Performance o Test scenarios o Write o Read o Comparison with local FS 15

Example: Word count occurences (1) Outline map(String key, String value): // key: document name (usually key isn’t used) 1. Introduction // value: document contents o Motivation for each word w in value:pair. o Basic notations EmitIntermediate(w, ”1”); 2. Overview reduce(String key, Iterator values): o Architectur o MapReduce // key: a word // values: a list of counts 2. Performance int result = 0; o Test scenarios o Write for each v in values: o Read result += ParseInt(v); o Comparison with Emit(AsString(result)); local FS 16

Example: Word count occurences (2) the folder “data” contains 2 files a and b with the following  contents: Outline – a: Hello World Bye World – b: Hello Hadoop Goodbye Hadoop 1. Introduction the following command will solve this problem  o Motivation o Basic notations > perl -p -e ’s/s+/n/g’ data/* | sort | uniq -c 2. Overview the output looks like  o Architectur 1 Bye o MapReduce 1 Goodbye 2. Performance 2 Hadoop o Test scenarios 2 Hello o Write 2 World o Read o Comparison with local FS 17

Hadoop Performance Evaluation Praktikum fr Fortgeschrittene Name: - PowerPoint PPT Presentation

Hadoop Performance Evaluation Praktikum fr Fortgeschrittene Name: Tien Duc Dinh Betreuer: Olga Mordvinova, Julian Kunkel Datum: 04-12-2007 Outline Introduction 1. Motivation Basic notations HDFS Overview 1.

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Fault Tolerance, Replication, and Consistency 1 Motivation: Hadoop Cluster 2 Motivation:

Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008 Agenda

HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap

The Evolution of Hadoop at Spotify Rafal Wojdyla (rav@spotify.com) Josh Baer (jbx@spotify.com)

Please be prepared to mute your audio following roll call. Call In: 650-479-3208 Meeting Code:

Q1 2020 Our Focus We assist and advise visionary ventures with penetrating the market Optimizing

An Open Source Education Management Platform Presenters Zach Vander Veen David Downs

EDUCATION CENTER REUSE ARLINGTON PUBLIC SCHOOLS 1426 N QUINCY ST BLPC / PFRC KICK-OFF SEPTEMBER

FYP Presentation 28th August 2017 Presented by: Cormac Reidy, Interactive Media, University of

RE-IMAGINING MINING TO IMPROVE PEOPLES LIVES Renaissance Capital Conference, 27 May 2020

PORT OF SAVANNAH The Southeast Gateway for the U.S. February 16, 2017 CREATING OPPORTUNITIES

Q4|2018 PERFORMANCE RESULTS PRESENTATION FOR INVESTOR & ANALYST DISCLAIMER The informat ion

Hadoop Performance Evaluation Praktikum fr Fortgeschrittene Name: - PowerPoint PPT Presentation

Hadoop Performance Evaluation Praktikum fr Fortgeschrittene Name: Tien Duc Dinh Betreuer: Olga Mordvinova, Julian Kunkel Datum: 04-12-2007 Outline Introduction 1. Motivation Basic notations HDFS Overview 1.

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Fault Tolerance, Replication, and Consistency 1 Motivation: Hadoop Cluster 2 Motivation:

Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008 Agenda

HADOOP Installation and Deployment of a Single Node on a Linux System Presented by: Liv Nguekap

The Evolution of Hadoop at Spotify Rafal Wojdyla (rav@spotify.com) Josh Baer (jbx@spotify.com)

Please be prepared to mute your audio following roll call. Call In: 650-479-3208 Meeting Code:

Q1 2020 Our Focus We assist and advise visionary ventures with penetrating the market Optimizing

An Open Source Education Management Platform Presenters Zach Vander Veen David Downs

EDUCATION CENTER REUSE ARLINGTON PUBLIC SCHOOLS 1426 N QUINCY ST BLPC / PFRC KICK-OFF SEPTEMBER

FYP Presentation 28th August 2017 Presented by: Cormac Reidy, Interactive Media, University of

RE-IMAGINING MINING TO IMPROVE PEOPLES LIVES Renaissance Capital Conference, 27 May 2020

PORT OF SAVANNAH The Southeast Gateway for the U.S. February 16, 2017 CREATING OPPORTUNITIES

Q4|2018 PERFORMANCE RESULTS PRESENTATION FOR INVESTOR &amp; ANALYST DISCLAIMER The informat ion

Q4|2018 PERFORMANCE RESULTS PRESENTATION FOR INVESTOR & ANALYST DISCLAIMER The informat ion