Outline Motivation and Overview of Hadoop Architecture, Design - PowerPoint PPT Presentation

T HE H ADOOP D ISTRIBUTED F ILE S YSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013

Outline • Motivation and Overview of Hadoop • Architecture, Design & Implementation of the Hadoop Distributed File System (HDFS) – Comparison with Google File System (GFS) • Performance Benchmarks • Conclusion

M OTIVATION AND O VERVIEW

Motivation and Overview ?? ?? ?? Pig Hive Sqoop MapReduce ?? MapReduce HBase Hadoop Distributed File Google File System (GFS) System (HDFS) • In the early 2000’s, Google developed the “Google File System” to support large distributed data-intensive applications • Shortly after, they developed “ MapReduce ” to allow developers to easily carry out large scale parallel computations • Examples: processing crawled documents, web request logs, etc. to produce inverted indices, statistics, etc. • Hadoop is an open source implementation of Google’s proprietary MapReduce framework; HDFS is the file system component of Hadoop

A RCHITECTURE , D ESIGN AND I MPLEMENTATION

HDFS Architecture NameNode Maintains namespace hierarchy and file system metadata such as block locations Namespace and metadata is stored in RAM but periodically flushed to disk. Modification log keeps on-disk image up to date. DataNodes Stores HDFS file data in local file system Receives commands from NameNode that instruct it to: • • Re-register or shutdown Replicate blocks to other nodes • • Send immediate block report Remove local block replicas HDFS Code library that exports HDFS file system interface to applications Client Reads data by transferring data from a DataNode directly Writes data by setting up a node-to-node pipeline and sends data to the first DataNode

Redundancy Mechanisms Image and Journal • An image is the file system metadata that describes organization of application data as directories and files • A persistent record of it written to disk is called a checkpoint • The journal is a write-ahead commit log for changes that must be persistent CheckpointNode and BackupNode • A NameNode can alternatively be run as a CheckpointNode or BackupNode • The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and empty journal • A BackupNode acts like a shadow of the NameNode and keeps an up-to-date copy of the image in memory

File I/O Operations and Replica Management File Read and Write • An application adds data to HDFS by creating a new file and writing data to it • All files are read and append only • HDFS implements a single-writer, multiple-reader model Data Streaming • When there is need for a new block, the NameNode allocates a new block ID and determines a list of DataNodes to host replicas of the block • Data is sent to the DataNodes in a pipeline fashion • Data may not be visible to readers until the file is closed Block Placement • Default Strategy ensures: • No DataNode contains more than one replica of any block • No rack contains more than two replicas of the same block

File Write Operation Source: The Hadoop Distributed File System

Data Replication NameNode /users/apokluda /log, r:2, {1, 3}, … /users/apokluda /data, r:3, {2, 4, 5}, … DataNodes 1 2 2 5 4 1 2 3 4 3 4 5 5 Rack A Rack B

H ADOOP DISTRIBUTED F ILE S YSTEM VS G OOGLE F ILE S YSTEM

HDFS vs GFS Implementation Hadoop Distributed File System Google File System Platform Cross-platform (Java) Linux (C/C++) License Open source (Apache 2.0) Proprietary (in-house use only) Developer(s) Yahoo! and open source Google community Architecture Hadoop Distributed File System Google File System Architecture Pattern Single NameNode has a global view of the entire file system Deployment Hardware Commodity servers (design to tolerate component failures) Inter-Node NameNode uses heartbeats to send commands to DataNodes Communication DataNode Design User-level server process stores blocks as files in local file system

HDFS vs GFS File System State Hadoop Distributed File System Google File System File Index State File index state and mapping of files to blocks kept in memory at NameNode and periodically flushed to disk; modification log records changes in between checkpoints Block Location State NameNode maintains and Block location information sent persistently stores block to NameNode by DataNodes on location information startup; not stored persistently at NameNode Data Integrity Checksums verified by Checksums verified by DataNodes clients

HDFS vs GFS File System Operations Hadoop Distributed File System Google File System • • Write Operations Append only Random offset write • Record append • Append • Write Consistency Single-writer model ensures Successful concurrent writes Guarantees files are always defined and create consistent but consistent undefined regions • Successful concurrent record appends create defined regions interspersed with inconsistent Deletion Deleted files renamed to a special Trash/Recycling Bin-like folder and removed lazily by garbage collection process Snapshots HDFS 2 allows each directory to Can snapshot individual files have up to 65,536 snapshots and directories Block Size 128 MB default but user 64 MB default but user configurable per file configurable per file

HDFS vs GFS Use Cases Hadoop Distributed File System Google File System Primary Use General purpose (production services, R&D) and MapReduce jobs Data Access Pattern Random access reads supported but optimized for streaming File Size Optimized for Large Files Replication User configurable per file, but 3 replicas stored by default Client API Custom library and command line utilities

P ERFORMANCE B ENCHMARKS

Performance Benchmarks DFSIO Production Cluster Sort • Read: 66 MB/s • Read: 1.02 MB/s • 1 TB sort per node per node • 22.1 MB/s per • Write: 40 MB/s • Write: 1.09 MB/s node (RW) per node per node • 1 PB sort • 9.35 MB/s per node (RW) Operation Throughput (Ops/s) Open File for Read 126,100 Create File 5600 Rename File 8300 Delete File 20,700 DataNode Heartbeat 300,000 Blocks Report (blocks/s) 639,700

C ONCLUSION

Conclusion • The Hadoop Distributed File System is designed to store very large data sets reliably and to stream these datasets to user applications at high bandwidth • The Hadoop MapReduce framework is designed to distribute storage and computation tasks across thousands of servers to enable resources to scale with demand while maintaining economical in size • The HDFS architecture consists of a single NameNode , many DataNodes and the HDFS client • Hadoop is an open source project that was inspired by Google’s proprietary Google File System and MapReduce framework

D ISCUSSION

Outline Motivation and Overview of Hadoop Architecture, Design - PowerPoint PPT Presentation

T HE H ADOOP D ISTRIBUTED F ILE S YSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture, Design & Implementation

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

ALI-ABA Course of Study Clean Water Act: Law and Regulation The Clean Water Act in the Supreme

Pedro Javier Garcia Jesus Escudero-Sahuquillo Universidad de Castilla-La Mancha Universidad de

Protecting Key Internal Business Relationships Nicholas J. Bakatsias Carruthers & Roth, P.A.

AFC Iraq Fund AFC Asia Frontier Fund CONFIDENTIAL January 2017 September 2013 CONTENTS

A Simple and Small Distributed File System Based on article TidyFS: A Simple and Small

!"#$%&'(()$*+,-.+/0-#'$1+2#,3,-#45$

When Your Business Depends On It The Evolution of a Global File System for a Global Enterprise

Distributed File Storage in Multi-Tenant Clouds using CephFS Openstack Vancouver 2018 May 23

Outline Motivation and Overview of Hadoop Architecture, Design - PowerPoint PPT Presentation

T HE H ADOOP D ISTRIBUTED F ILE S YSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture, Design & Implementation

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

ALI-ABA Course of Study Clean Water Act: Law and Regulation The Clean Water Act in the Supreme

Pedro Javier Garcia Jesus Escudero-Sahuquillo Universidad de Castilla-La Mancha Universidad de

Protecting Key Internal Business Relationships Nicholas J. Bakatsias Carruthers &amp; Roth, P.A.

AFC Iraq Fund AFC Asia Frontier Fund CONFIDENTIAL January 2017 September 2013 CONTENTS

A Simple and Small Distributed File System Based on article TidyFS: A Simple and Small

!&quot;#$%&amp;'(()$*+,-.+/0-#'$1+2#,3,-#45$

When Your Business Depends On It The Evolution of a Global File System for a Global Enterprise

Distributed File Storage in Multi-Tenant Clouds using CephFS Openstack Vancouver 2018 May 23

Protecting Key Internal Business Relationships Nicholas J. Bakatsias Carruthers & Roth, P.A.

!"#$%&'(()$*+,-.+/0-#'$1+2#,3,-#45$