HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid - - PowerPoint PPT Presentation

hdfs under the hood
SMART_READER_LITE
LIVE PREVIEW

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid - - PowerPoint PPT Presentation

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. Yahoo! Yahoo! 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work Yahoo! 2 Hadoop Hadoop provides


slide-1
SLIDE 1

1 Yahoo!

HDFS Under the Hood

Sanjay Radia

Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Yahoo!

slide-2
SLIDE 2

2 Yahoo!

Outline

  • Overview of Hadoop, an open source project
  • Design of HDFS
  • On going work
slide-3
SLIDE 3

3 Yahoo!

Hadoop

  • Hadoop provides a framework for storing and processing petabytes
  • f data using commodity hardware and storage
  • Storage: HDFS, HBase
  • Processing: MapReduce

HDFS MapReduce Pig HBase

slide-4
SLIDE 4

4 Yahoo!

Hadoop, an Open Source Project

  • Implemented in Java
  • Apache Top Level Project

– http://hadoop.apache.org/core/ – Core (15 Committers)

  • HDFS
  • MapReduce

– Hbase (3 Committers)

  • Community of contributors is growing

– Though mostly Yahoo for HDFS and MapReduce – Powerset is leading the effort for HBase – Facebook is in process of opening Hive – Opportunities contributing to a major open source project

slide-5
SLIDE 5

5 Yahoo!

Hadoop Users

  • Clusters from 1 to 2k nodes

– Yahoo, Last.fm, Joost, Facebook, A9, … – In production use at Yahoo in multiple 2k clusters – Initial tests show that 0.18 will scale to 4K nodes - being validated

  • Broad academic interest

– IBM/Google cloud computing initiative

  • A 40 node cluster + Xen based VMs …

– CMU/Yahoo supercomputer cluster

  • M45 - 500 node cluster
  • Looking into making this more widely available to other universities
  • Hadoop Summit hosted by Yahoo! in march 2008

– Attracted over 400 attendees

slide-6
SLIDE 6

6 Yahoo!

Hadoop Characteristics

  • Commodity HW + Horizontal scaling

– Add inexpensive servers with JBODS – Storage servers and their disks are not assumed to be highly reliable and available

  • Use replication across servers to deal with unreliable storage/servers
  • Metadata-data separation - simple design

– Storage scales horizontally – Metadata scales vertically (today)

  • Slightly Restricted file semantics

– Focus is mostly sequential access – Single writers – No file locking features

  • Support for moving computation close to data

– i.e. servers have 2 purposes: data storage and computation

Simplicity of design why a small team could build such a large system in the first place

slide-7
SLIDE 7

7 Yahoo!

Distributed FS Vertical Scaling Horizontally Scale IO and Storage Horizontally Scale namespace ops and IO Vertically Scale namespace ops

File Systems Background(1) : Scaling

slide-8
SLIDE 8

8 Yahoo!

File Systems Background(2) Federating

  • Andrew

– Federated mount of file systems on /afs – (Plus follow on work on disconnected operations)

  • Newcastle connection

– /.. Mounts

  • (plus remote Unix semantics)
  • Many others ….
slide-9
SLIDE 9

9 Yahoo!

File systems Background (3)

  • Separation of metadata from data - 1978, 1980

– “Separating Data from Function in a Distributed File System” (1978)

– by J E Israel, J G Mitchell, H E Sturgis

– “A universal file server” (1980) by A D Birrell, R M Needham

  • + Horizontal scaling of storage nodes and io bandwidth

– Several startups building scalable NFS – Luster – GFS – pNFS

  • + Commodity HW with JBODs, Replication, Non-posix semantics

– GFS

  • + Computation close to the data

– GFS/MapReduce

slide-10
SLIDE 10

10 Yahoo!

Hadoop: Multiple FS implementations

FileSystem is the interface for accessing the file system It has multiple implementations:

  • HDFS: “hdfs://”
  • Local file system: “file://”
  • Amazon S3: “s3://”

– See Tom White’s writeup on using hadoop on EC2/S3

  • Kosmos “kfs://”
  • Archive “har://” - 0.18
  • You can set your default file system

– so that your file names are simply /foo/bar/…

  • MapReduce uses the FileSystem interface - hence it can run on multiple file

systems

slide-11
SLIDE 11

11 Yahoo!

HDFS: Directories, Files & Blocks

  • Data is organized into files and directories
  • Files are divided into uniform sized blocks and distributed across

cluster nodes

  • Blocks are replicated to handle hardware failure
  • HDFS keeps checksums of data for corruption detection and

recovery

  • HDFS (& FileSystem) exposes block placement so that computation

can be migrated to data

slide-12
SLIDE 12

12 Yahoo!

HDFS Architecture (1)

  • Files broken into blocks of 128MB (per-file configurable)
  • Single Namenode

– Manages the file namespace – File name to list blocks + location mapping – File metadata (i.e. “inode”) – Authorization and authentication – Collect block reports from Datanodes on block locations – Replicate missing blocks – Implementation detail:

  • Keeps ALL namespace in memory plus checkpoints & journal

– 60M objects on 16G machine (e.g. 20M files with 2 blocks each)

  • Datanodes (thousands) handle block storage

– Clients access the blocks directly from data nodes – Data nodes periodically send block reports to Namenode – Implementation detail:

  • Datanodes store the blocks using the underlying OS’s files
slide-13
SLIDE 13

13 Yahoo!

HDFS Architecture (2)

b1 b2 b3 b1 b5 b3 b3 b5 b2 b4 b5 b6 b2 b3 b4

Client Client Namenode replicate getLocations getFileInfo read write create addBlock write write Metadata Metadata Log blockReceived copy

slide-14
SLIDE 14

14 Yahoo!

HDFS Architecture (3): Computation close to the data

Data

Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data

Results

Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data

Hadoop Cluster

DFS Block 1 DFS Block 1 DFS Block 2 DFS Block 2 DFS Block 2 DFS Block 1 DFS Block 3 DFS Block 3 DFS Block 3

MAP MAP MAP Reduce

slide-15
SLIDE 15

15 Yahoo!

Reads, Writes, Block Placement and Replication

Reads

  • From the nearest replica

Writes

  • Writes are pipelined to block replicas
  • Append is mostly in 0.18, will be completed in 0.19

– Hardest part is dealing with failures of DNs holding the replicas during an append

  • Generation number to deal with failures of DNs during write

– Concurrent appends are likely to happen in the future

  • No plans to add random writes so far.

Replication and block placement

  • A file’s replication factor can be changed dynamically (default 3)
  • Block placement is rack aware
  • Block under-replication & over-replication is detected by Namenode

– triggers a copy or delete operation

  • Balancer application rebalances blocks to balance DN utilization
slide-16
SLIDE 16

16 Yahoo!

Details (1): The Protocols

  • Client to Namenode

– RPC

  • Client to Datanode

– Streaming writes/reads

  • On reads data shipped directly from OS

– Considering RPC for pread(offset, bytes)

  • Datanode to Namenode

– RPC (heartbeat, blockReport …) – RPC Reply is the command from Namenode to Datanode (copy block …)

  • RPC

– Not cross language but that was the goal (hence not RMI …) – 0.18 rpc is quite good

  • solves timeouts and manages queues/buffers and spikes well
slide-17
SLIDE 17

17 Yahoo!

Current & near term work at Yahoo! (1)

  • HDFS

– Improved RPC - 0.18 - made a significant improvement in scaling – Clean up the interfaces - 0.18, 0.19 – Improved Reliability & Availability - 0.17, 0.18, 0.19

  • Current Uptime: 98.5% (includes planned downtime)
  • Data loss: A few data blks lost due to bugs and corruption, Never had any fsimage corruption

– Append - 0.18, 0.19 – Authorization (Posix like permissions) 0.16 – Authentication - in progress, 0.20? – Performance - 0.17, 0.18, 0.19, …

  • 0.16 Performance

– Reads within in Rack: 4 threads: 110MB/s, 1 Thread 40MB/s (buffer copies fixed in 0.17) – Writes within Rack: 4 threads 45MB/s, 1 Thread 21MB/s

  • Goal: Read/write at speed of network/disk with low CPU utilization

– NN scaling – NN Replication and HA – Protocol/Interface versioning – Language agnostic RPC

slide-18
SLIDE 18

18 Yahoo!

Current & near term work at Yahoo! (2)

  • MapReduce

– New API using context objects - 0.19? – New resource manager/scheduler framework

  • Main goal - release resources not being used (e.g. during reduce phase)
  • Pluggable schedulers

– Yahoo - Queues with guaranteed capacity + priorities + user quotas – Facebook - Queue per user?

slide-19
SLIDE 19

19 Yahoo!

Scaling the Name Service: Options

Partial NS (Cache) in memory Dynamic Partitioning Mountable Volumes + Automount volume catalog All NS in memory Archives

# names

# clients 20M 2000M+

NS in memory Plus RO Replicas

60M 400M 1000M 1x 4x 20x 50x

Partial NS in memory With mountable volumes

Not to scale

100x

NS in malloc hean + RO Replicas + Finer grain locks

Good isolation properties

Separate Bmaps from NN

slide-20
SLIDE 20

20 Yahoo!

Scaling the Name Service: Mountable Namespace Volumes with “Automounter”

slide-21
SLIDE 21

21 Yahoo!

Scaling using Volumes

Multiple name space volumes that share the data node

  • Volume = Namespace + Blockpool

– Volume is a self-contained unit

  • can gc independent of other volumes

– A grid has multiple volumes and hence multiple blockpools

  • One Storage Pool

– Each DNs can contribute to each blockpool – A block-pool is like a LUN – Block pool manager deals with replication & BRs

  • Each name space registered in catalog
  • Automount at top level using zoo-keeper

– BTW we will support mounting elsewhere

  • Block-pools and volumes useful for other things

– Snapshots, temp space, per-job tmp namespace, – quotas, federation, –

  • ther (non hdfs) namespace built on block-pool

Grid 2 DN DN DN Block Pool Block Pool Block Pool NS NS NS Grid 1

V = NS + BP

slide-22
SLIDE 22

22 Yahoo!

DN, BP and NS Interactions

Client DN DN DN Block Pool 1 Manager NS1

HB & BRs for BP1

GetLocations(B) getListOfValidBlocks()

Free, Replicate B in BP1

Block Pool 2 Manager NS2

HB & BRs for BP2

Free, Replicate B in BP2

GetBlist(pathname)

slide-23
SLIDE 23

23 Yahoo!

One example of managing namespaces

Grid1.corp ysearch.corp /

users projects data tmp

d1 d2 d3 p2 p1

.. .. Yahoo-inc.com Cmu.org

slide-24
SLIDE 24

24 Yahoo!

Namespace Volumes

  • Plus

– Scales both namespace and performance – Gives performance isolation between volumes – Well understood concepts

  • Posix mount, DNS “mount” and automounter
  • block-pools are logical storage units (Luns)

– Volume (NS + Block-pool) is self-contained unit with no dependencies on other volumes – Block-pools and volumes useful for other things

  • Snapshots, temp space, per-job tmp namespace, quotas, federation,
  • ther (non hdfs) namespace built on block-pool

– Compatible with separation of Block maps (true for most other solutions) – Compatible with RO replicas (true for most other solutions)

  • Minus

– Manual separation of NSs (a few can be done automatically) – Not completely transparent (namespace structure changes unless you manually mount in right place)

  • Common issue for all NS partitioning solutions

– Managing multiple NNs

slide-25
SLIDE 25

25 Yahoo!

More Info

  • Main Web sites

– http://hadoop.apache.org/core/ – http://wiki.apache.org/hadoop/ – http://wiki.apache.org/hadoop/GettingStartedWithHadoop – http://wiki.apache.org/hadoop/HadoopMapReduce

  • Hadoop Futures - categorized list of areas of new development

– http://wiki.apache.org/hadoop/HdfsFutures

  • Some ideas for interesting projects on Hadoop

– http://research.yahoo.com/node/1980

  • My contact

– Sradia@yahoo-inc.com