The Hadoop Distributed File System Konstantin Shvachko, Hairong - PowerPoint PPT Presentation

The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA Presented by Haoran Ma, Yifan Qiao

Outline • Introduction • Architecture • File I/O Operations and Replica Management • Practice at Y AHOO ! • Future Work

Introduction • A single dataset is too large —> Divide it and store them on a cluster of commodity hardwares. - What if one of the physical machines fails? • Some applications like MapReduce need high throughput of data access.

Introduction • HDFS is the file system component of Hadoop. It is designed to store very large data sets reliably , and to stream those data sets at high bandwidth to user applications. • These are achieved by replicating file contents on multiple machines(DataNodes).

Introduction • Very Large Distributed File System • Assumes Commodity Hardware - Files are replicated to handle hardware failure • Optimized for Batch Processing - Data locations exposed so that computations can move to where data resides

Introduction Usually 128 MB Source: HDFS Tutorial – A Complete Hadoop HDFS Overview. DATAFLAIR TEAM.

Architecture Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

Architecture • Stores meta-data such as number of data blocks, replicas and other details in memory • Maintains and manages the DataNodes, and assigns tasks to them

Architecture Store Application Data Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

Architecture HDFS Client: a code library that exports the HDFS file system interface

Architecture • How does this architecture achieve high fault- tolerance? • DataNodes Failure • NameNode Failure

Architecture: Failure Recovery for DataNodes Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Architecture: Failure Recovery for DataNodes Heartbeat Block Report Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Architecture: Failure Recovery for DataNodes Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Architecture: Failure Recovery for DataNodes What if NameNode fails?

Architecture: Failure Recovery for NameNode Image = Checkpoint + Journal • Image: The file system metadata that describes the organization of application data as directories and files. • Checkpoint: A persistent record of the image written to disk. • Journal: The modification log of the image. It is also stored in the local host’s native file system.

Architecture: Failure Recovery for NameNode • CheckpointNode : • Periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal. • BackupNode : • A read-only NameNode. • Maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.

Architecture: Failure Recovery for NameNode • Snapshots • To minimize potential damage to the data stored in the system during upgrades. • Persistently save the current state of the file system( both data and metadata ).

Architecture: Failure Recovery for NameNode • Snapshots • To minimize potential damage to the data Copy on Write stored in the system during upgrades. • Persistently save the current state of the file system( both data and metadata ).

Architecture: Failure Recovery for NameNode NameNode BackupNode Memory: Memory: Image Image Synchronize Disk: CheckpointNode Checkpoint Disk: Journal Combine Return New Checkpoint Empty Journal DataNode (Example) Disk: Snapshot Snapshot (Only Hard Links)

Usage & Management of HDFS Cluster • Basic File I/O operations • Rack Awareness • Replication Management

File I/O Operations • Write Files to HDFS: Single writer, multiple reader (2) unique block ids (1) addBlock (3) write to block Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

File I/O Operations • Write Files to HDFS: Single writer, multiple reader 1. Client consults the (2) unique NameNode to get a lease block ids and destination DataNodes (1) addBlock 2. Client writes a block to (3) write to block DataNodes in a pipeline way 3. DataNodes replicate blocks Source: Understanding Hadoop Clusters and the Network. Brad Hedlund. The visibility of the modification is 4. Client writes a new block not guaranteed! after finishing the previous block

File I/O Operations • Read Files in HDFS 1. the client consults the NameNode to get the list of blocks and their replicas' locations 2. try the nearest replica first, then the second nearest replica, and so on • Identifying corrupted data • Checksums-CRC32

In-cluster Client Reads a File Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Outside Client Reads a File Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Rack Awareness Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Rack Awareness • Benefits: • higher throughput • higher reliability : • an entire rack failure never loses all replicas of a block • better network bandwidth utilization: • reduce inter-rack and inter-node write traffic as much as possible

Rack Awareness • The default HDFS replica placement policy: 1. No DataNode contains more than one replica of any block 2. No rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster

Replication Management • to avoid blocks to be under- or over-replicated Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Practice at Yahoo! Cluster Basic Information • Clusters at Yahoo! can be as large as ~3500 nodes with typical configuration: • 2 quad core Xeon processors @ 2.5Ghz • 4 directly attached SATA drives (1TB each, 4TB total ) • 16GB RAM • 1-Gbit Ethernet • Total 9.8PB of storage available, 3.3PB available for user applications when replicating blocks 3 times

Practice at Yahoo! Data Durability • Uncorrelated nodes failure: • Chance of a node failed during a month ~0.8% (Naive estimation for a node failure probability during a year is ~9.2% ) • Chance of losing a block during a year < 0.5% • Correlated nodes failure: • HDFS tolerates a rack switch failure • But a core switch failure or cluster power loss can lose some blocks

Practice at Yahoo! • Benchmarks Read (MB/s Write (MB/s Scenario per node) per node) DFSIO 66 40 < 130 < 130 7200 RPM Desktop HDD [6] (typical 50-120) (typical 50-120) Table1: Contrived benchmark compared with typical HDD performance Read (MB/s Write (MB/s Scenario per node) per node) Busy Cluster 1.02 1.09 Table2: HDFS performance in a production cluster

Practice at Yahoo! • Benchmarks HDFS I/O Bytes/s Bytes (TB) Nodes Maps Reduces Time / s Per Node Aggregate (GB) (MB) 1 1460 8000 2700 62 32 22.1 1000 3658 80000 20000 58500 34.2 9.35 Table 3: Sort benchmark 1000TB is too large to fit in the node memory intermediate results spill to disks and occupy disk bandwidth

Practice at Yahoo! • Benchmarks Operation Throughput (ops/s) Open file for read 126 100 involve modifying Create file 5600 nodes, Rename file 8300 can be the bottleneck in large scale Delete file 20 700 DataNode Heartbeat 300 000 Blocks report (blocks/s) 639 700 Table4: NameNode throughput benchmark

Summary: HDFS: Two Easy Pieces* • Reliability • Throughput *: The title is from two great books: Six Easy Pieces: Essentials Of Physics Explained By Its Most Brilliant Teacher , by Richard P . Feynman, and Operating Systems: Three Easy Pieces , by Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau

Summary: HDFS: Reliability • System Design: • Split files into blocks and replicate them (typical 3) • For NameNode: • Checkpoint + Journal can restore the latest image • BackupNode • Snapshot • NameNode is the single point of failure of the whole system - NOT GOOD! • For DataNodes: • Rack Awareness + Replica Placement Policy, never lose a block if a rack fails • Replica Management, to avoid blocks to be under-replicated • Snapshot

Summary: HDFS: Throughput • System Design • Split files into large blocks (128MB) - good for streaming access and parallel access • Provide APIs that expose the location of blocks - facilitating applications to schedule computation tasks to where the data reside • NameNode - Not Good for High Throughput and Scalability • Single node handles all requests from clients and manages all DataNodes • DataNodes • Rack Awareness & replica placement policy - better utilizing network bandwidth • Write files in a pipeline way • Read files from the nearest DataNode first

Future Work (Out of Date!) • Automated failover solution • Zookeeper • Scalability of the NameNode • multiple namespaces to share physical storage • Drawbacks: • Advantages: • isolate namespaces • management cost • improve the overall availability • generalize the block storage abstraction

Thank you.

The Hadoop Distributed File System Konstantin Shvachko, Hairong - PowerPoint PPT Presentation

The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA Presented by Haoran Ma, Yifan Qiao Outline Introduction Architecture File I/O Operations and

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Hadoop Distributed File System (HDFS) 10/05/2018 1 HDFS Overview A distributed file system

Hadoop Distributed File System (HDFS) 1 HDFS Overview A distributed file system Built on

Hadoop Distributed File System (HDFS) 1 HDFS Overview A distributed file system Built on the

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

File Management What is a file? Elements of file management File organization

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

HDFS Hadoop Distributed File System Motivation File Management Streaming Data Fault Tolerance

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

1 What Limits Performance? Stalls (Data Hazards) Data hazards Code Instruction depends on

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan , Aaron Harlap

Ideas for evolution of replication technology @ CERN Openlab Minor Review December 14 th , 2010

The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann, Paul

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

High availability and analysis of PostgreSQL Sergey Kalinin 18-19 of April 2012, dCache

Zerto Virtual Replication 4.5 Disaster Recovery Evolved Zerto provides enterprise-class, virtual

Replicating the Performance Evaluation of an N-Body Application on a Manycore Accelerator