The Hadoop Distributed File System Konstantin Shvachko, Hairong - - PowerPoint PPT Presentation

the hadoop distributed file system
SMART_READER_LITE
LIVE PREVIEW

The Hadoop Distributed File System Konstantin Shvachko, Hairong - - PowerPoint PPT Presentation

The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA Presented by Haoran Ma, Yifan Qiao Outline Introduction Architecture File I/O Operations and


slide-1
SLIDE 1

Presented by Haoran Ma, Yifan Qiao

The Hadoop Distributed File System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA

slide-2
SLIDE 2

Outline

  • Introduction
  • Architecture
  • File I/O Operations and Replica Management
  • Practice at YAHOO!
  • Future Work
slide-3
SLIDE 3

Introduction

  • A single dataset is too large —> Divide it and

store them on a cluster of commodity hardwares.

  • What if one of the physical machines fails?
  • Some applications like MapReduce need high

throughput of data access.

slide-4
SLIDE 4

Introduction

  • HDFS is the file system component of Hadoop. It

is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications.

  • These are achieved by replicating file contents
  • n multiple machines(DataNodes).
slide-5
SLIDE 5

Introduction

  • Very Large Distributed File System
  • Assumes Commodity Hardware
  • Files are replicated to handle hardware failure
  • Optimized for Batch Processing
  • Data locations exposed so that computations

can move to where data resides

slide-6
SLIDE 6

Introduction

Usually 128 MB

Source: HDFS Tutorial – A Complete Hadoop HDFS Overview. DATAFLAIR TEAM.

slide-7
SLIDE 7

Architecture

Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

slide-8
SLIDE 8

Architecture

  • Stores meta-data such as number of data

blocks, replicas and other details in memory

  • Maintains and manages the DataNodes, and

assigns tasks to them

slide-9
SLIDE 9

Architecture

Store Application Data

Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

slide-10
SLIDE 10

Architecture

HDFS Client: a code library that exports the HDFS file system interface

slide-11
SLIDE 11

Architecture

  • How does this architecture achieve high fault-

tolerance?

  • DataNodes Failure
  • NameNode Failure
slide-12
SLIDE 12

Architecture:

Failure Recovery for DataNodes

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

slide-13
SLIDE 13

Architecture:

Failure Recovery for DataNodes

Block Report Heartbeat

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

slide-14
SLIDE 14

Architecture:

Failure Recovery for DataNodes

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

slide-15
SLIDE 15

Architecture:

Failure Recovery for DataNodes

What if NameNode fails?

slide-16
SLIDE 16

Architecture:

Failure Recovery for NameNode

Image = Checkpoint + Journal

  • Image: The file system metadata that describes

the organization of application data as directories and files.

  • Checkpoint: A persistent record of the image

written to disk.

  • Journal: The modification log of the image. It is

also stored in the local host’s native file system.

slide-17
SLIDE 17

Architecture:

Failure Recovery for NameNode

  • CheckpointNode:
  • Periodically combines the existing checkpoint and

journal to create a new checkpoint and an empty journal.

  • BackupNode:
  • A read-only NameNode.
  • Maintains an in-memory, up-to-date image of the file

system namespace that is always synchronized with the state of the NameNode.

slide-18
SLIDE 18

Architecture:

Failure Recovery for NameNode

  • Snapshots
  • To minimize potential damage to the data

stored in the system during upgrades.

  • Persistently save the current state of the file

system(both data and metadata).

slide-19
SLIDE 19

Architecture:

Failure Recovery for NameNode

  • Snapshots
  • To minimize potential damage to the data

stored in the system during upgrades.

  • Persistently save the current state of the file

system(both data and metadata).

Copy on Write

slide-20
SLIDE 20

Architecture:

Failure Recovery for NameNode

Combine Return

NameNode DataNode (Example) BackupNode CheckpointNode Memory:

Image

Disk: Memory:

Synchronize

Image Checkpoint Journal

New Checkpoint Empty Journal

Snapshot Snapshot (Only Hard Links)

Disk: Disk:

slide-21
SLIDE 21

Usage & Management of HDFS Cluster

  • Basic File I/O operations
  • Rack Awareness
  • Replication Management
slide-22
SLIDE 22

File I/O Operations

  • Write Files to HDFS: Single writer, multiple reader

(1) addBlock (2) unique block ids (3) write to block

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

slide-23
SLIDE 23
  • Write Files to HDFS: Single writer, multiple reader

(1) addBlock (2) unique block ids (3) write to block

  • 1. Client consults the

NameNode to get a lease and destination DataNodes

  • 2. Client writes a block to

DataNodes in a pipeline way

  • 3. DataNodes replicate blocks
  • 4. Client writes a new block

after finishing the previous block

The visibility of the modification is not guaranteed!

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

File I/O Operations

slide-24
SLIDE 24
  • 1. the client consults the NameNode to get the

list of blocks and their replicas' locations

  • 2. try the nearest replica first, then the second

nearest replica, and so on

  • Identifying corrupted data
  • Checksums-CRC32
  • Read Files in HDFS

File I/O Operations

slide-25
SLIDE 25

In-cluster Client Reads a File

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

slide-26
SLIDE 26

Outside Client Reads a File

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

slide-27
SLIDE 27

Rack Awareness

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

slide-28
SLIDE 28

Rack Awareness

  • Benefits:
  • higher throughput
  • higher reliability:
  • an entire rack failure never loses all replicas of a

block

  • better network bandwidth utilization:
  • reduce inter-rack and inter-node write traffic as

much as possible

slide-29
SLIDE 29
  • The default HDFS replica placement policy:
  • 1. No DataNode contains more than one

replica of any block

  • 2. No rack contains more than two replicas of

the same block, provided there are sufficient racks on the cluster

Rack Awareness

slide-30
SLIDE 30

Replication Management

  • to avoid blocks to be under- or over-replicated

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

slide-31
SLIDE 31

Practice at Yahoo!

  • Clusters at Yahoo! can be as large as ~3500 nodes with

typical configuration:

  • 2 quad core Xeon processors @ 2.5Ghz
  • 4 directly attached SATA drives (1TB each, 4TB total)
  • 16GB RAM
  • 1-Gbit Ethernet
  • Total 9.8PB of storage available, 3.3PB available for user

applications when replicating blocks 3 times

Cluster Basic Information

slide-32
SLIDE 32

Practice at Yahoo!

  • Uncorrelated nodes failure:
  • Chance of a node failed during a month ~0.8%

(Naive estimation for a node failure probability during a year is ~9.2%)

  • Chance of losing a block during a year < 0.5%
  • Correlated nodes failure:
  • HDFS tolerates a rack switch failure
  • But a core switch failure or cluster power loss can

lose some blocks

Data Durability

slide-33
SLIDE 33

Practice at Yahoo!

  • Benchmarks

Scenario Read (MB/s per node) Write (MB/s per node) DFSIO 66 40

7200 RPM Desktop HDD[6]

< 130 (typical 50-120) < 130 (typical 50-120)

Table1: Contrived benchmark compared with typical HDD performance

Scenario Read (MB/s per node) Write (MB/s per node) Busy Cluster 1.02 1.09

Table2: HDFS performance in a production cluster

slide-34
SLIDE 34

Practice at Yahoo!

  • Benchmarks

Bytes (TB) Nodes Maps Reduces Time / s HDFS I/O Bytes/s Aggregate (GB) Per Node (MB) 1 1460 8000 2700 62 32 22.1 1000 3658 80000 20000 58500 34.2 9.35

Table 3: Sort benchmark

1000TB is too large to fit in the node memory intermediate results spill to disks and occupy disk bandwidth

slide-35
SLIDE 35

Practice at Yahoo!

  • Benchmarks

Operation Throughput (ops/s) Open file for read 126 100 Create file 5600 Rename file 8300 Delete file 20 700 DataNode Heartbeat 300 000 Blocks report (blocks/s) 639 700

Table4: NameNode throughput benchmark

involve modifying nodes, can be the bottleneck in large scale

slide-36
SLIDE 36

Summary: HDFS: Two Easy Pieces*

  • Reliability
  • Throughput

*: The title is from two great books: Six Easy Pieces: Essentials Of Physics Explained By Its Most Brilliant Teacher, by Richard P . Feynman, and Operating Systems: Three Easy Pieces, by Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau

slide-37
SLIDE 37

Summary: HDFS: Reliability

  • System Design:
  • Split files into blocks and replicate them (typical 3)
  • For NameNode:
  • Checkpoint + Journal can restore the latest image
  • BackupNode
  • Snapshot
  • NameNode is the single point of failure of the whole system - NOT GOOD!
  • For DataNodes:
  • Rack Awareness + Replica Placement Policy, never lose a block if a rack

fails

  • Replica Management, to avoid blocks to be under-replicated
  • Snapshot
slide-38
SLIDE 38

Summary: HDFS: Throughput

  • System Design
  • Split files into large blocks (128MB) - good for streaming access and

parallel access

  • Provide APIs that expose the location of blocks - facilitating applications

to schedule computation tasks to where the data reside

  • NameNode - Not Good for High Throughput and Scalability
  • Single node handles all requests from clients and manages all DataNodes
  • DataNodes
  • Rack Awareness & replica placement policy - better utilizing network

bandwidth

  • Write files in a pipeline way
  • Read files from the nearest DataNode first
slide-39
SLIDE 39

Future Work (Out of Date!)

  • Automated failover solution
  • Zookeeper
  • Scalability of the NameNode
  • multiple namespaces to share physical storage
  • Advantages:
  • isolate namespaces
  • improve the overall availability
  • generalize the block storage

abstraction

  • Drawbacks:
  • management cost
slide-40
SLIDE 40

Thank you.

slide-41
SLIDE 41
  • Reference:

[1] “The Hadoop Distributed File System”. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler. [2] “Hadoop HDFS Architecture Explanation and Assumptions”. DATAFLAIR TEAM. https://data-flair.training/blogs/hadoop-hdfs-architecture/ [3] “HDFS Tutorial – A Complete Hadoop HDFS Overview”. DATAFLAIR TEAM. https:// data-flair.training/blogs/hadoop-hdfs-tutorial/ [4] “HDFS Architecture”. http://hadoop.apache.org/docs/current/hadoop-project-dist/ hadoop-hdfs/HdfsDesign.html [5] “Understanding Hadoop Clusters and the Network”. Brad Hedlund. http:// bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/ [6] "Speed Considerations". Seagate. https://web.archive.org/web/20110920075313/ http://www.seagate.com/www/en-us/support/before_you_buy/speed_considerations