1 Yahoo!
HDFS Under the Hood
Sanjay Radia
Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.
Yahoo!
HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid - - PowerPoint PPT Presentation
HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. Yahoo! Yahoo! 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work Yahoo! 2 Hadoop Hadoop provides
1 Yahoo!
Yahoo!
2 Yahoo!
3 Yahoo!
HDFS MapReduce Pig HBase
4 Yahoo!
– http://hadoop.apache.org/core/ – Core (15 Committers)
– Hbase (3 Committers)
– Though mostly Yahoo for HDFS and MapReduce – Powerset is leading the effort for HBase – Facebook is in process of opening Hive – Opportunities contributing to a major open source project
5 Yahoo!
– Yahoo, Last.fm, Joost, Facebook, A9, … – In production use at Yahoo in multiple 2k clusters – Initial tests show that 0.18 will scale to 4K nodes - being validated
– IBM/Google cloud computing initiative
– CMU/Yahoo supercomputer cluster
– Attracted over 400 attendees
6 Yahoo!
– Add inexpensive servers with JBODS – Storage servers and their disks are not assumed to be highly reliable and available
– Storage scales horizontally – Metadata scales vertically (today)
– Focus is mostly sequential access – Single writers – No file locking features
– i.e. servers have 2 purposes: data storage and computation
Simplicity of design why a small team could build such a large system in the first place
7 Yahoo!
Distributed FS Vertical Scaling Horizontally Scale IO and Storage Horizontally Scale namespace ops and IO Vertically Scale namespace ops
8 Yahoo!
– Federated mount of file systems on /afs – (Plus follow on work on disconnected operations)
– /.. Mounts
9 Yahoo!
– “Separating Data from Function in a Distributed File System” (1978)
– by J E Israel, J G Mitchell, H E Sturgis
– “A universal file server” (1980) by A D Birrell, R M Needham
– Several startups building scalable NFS – Luster – GFS – pNFS
– GFS
– GFS/MapReduce
10 Yahoo!
FileSystem is the interface for accessing the file system It has multiple implementations:
– See Tom White’s writeup on using hadoop on EC2/S3
– so that your file names are simply /foo/bar/…
systems
11 Yahoo!
cluster nodes
recovery
can be migrated to data
12 Yahoo!
– Manages the file namespace – File name to list blocks + location mapping – File metadata (i.e. “inode”) – Authorization and authentication – Collect block reports from Datanodes on block locations – Replicate missing blocks – Implementation detail:
– 60M objects on 16G machine (e.g. 20M files with 2 blocks each)
– Clients access the blocks directly from data nodes – Data nodes periodically send block reports to Namenode – Implementation detail:
13 Yahoo!
b1 b2 b3 b1 b5 b3 b3 b5 b2 b4 b5 b6 b2 b3 b4
Client Client Namenode replicate getLocations getFileInfo read write create addBlock write write Metadata Metadata Log blockReceived copy
14 Yahoo!
Data
Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data Data data data data data
Results
Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data Data data data data
Hadoop Cluster
DFS Block 1 DFS Block 1 DFS Block 2 DFS Block 2 DFS Block 2 DFS Block 1 DFS Block 3 DFS Block 3 DFS Block 3
MAP MAP MAP Reduce
15 Yahoo!
Reads
Writes
– Hardest part is dealing with failures of DNs holding the replicas during an append
– Concurrent appends are likely to happen in the future
Replication and block placement
– triggers a copy or delete operation
16 Yahoo!
– RPC
– Streaming writes/reads
– Considering RPC for pread(offset, bytes)
– RPC (heartbeat, blockReport …) – RPC Reply is the command from Namenode to Datanode (copy block …)
– Not cross language but that was the goal (hence not RMI …) – 0.18 rpc is quite good
17 Yahoo!
– Improved RPC - 0.18 - made a significant improvement in scaling – Clean up the interfaces - 0.18, 0.19 – Improved Reliability & Availability - 0.17, 0.18, 0.19
– Append - 0.18, 0.19 – Authorization (Posix like permissions) 0.16 – Authentication - in progress, 0.20? – Performance - 0.17, 0.18, 0.19, …
– Reads within in Rack: 4 threads: 110MB/s, 1 Thread 40MB/s (buffer copies fixed in 0.17) – Writes within Rack: 4 threads 45MB/s, 1 Thread 21MB/s
– NN scaling – NN Replication and HA – Protocol/Interface versioning – Language agnostic RPC
18 Yahoo!
– New API using context objects - 0.19? – New resource manager/scheduler framework
– Yahoo - Queues with guaranteed capacity + priorities + user quotas – Facebook - Queue per user?
19 Yahoo!
Partial NS (Cache) in memory Dynamic Partitioning Mountable Volumes + Automount volume catalog All NS in memory Archives
# names
# clients 20M 2000M+
NS in memory Plus RO Replicas
60M 400M 1000M 1x 4x 20x 50x
Partial NS in memory With mountable volumes
Not to scale
100x
NS in malloc hean + RO Replicas + Finer grain locks
Good isolation properties
Separate Bmaps from NN
20 Yahoo!
21 Yahoo!
Multiple name space volumes that share the data node
– Volume is a self-contained unit
– A grid has multiple volumes and hence multiple blockpools
– Each DNs can contribute to each blockpool – A block-pool is like a LUN – Block pool manager deals with replication & BRs
– BTW we will support mounting elsewhere
– Snapshots, temp space, per-job tmp namespace, – quotas, federation, –
Grid 2 DN DN DN Block Pool Block Pool Block Pool NS NS NS Grid 1
V = NS + BP
22 Yahoo!
Client DN DN DN Block Pool 1 Manager NS1
HB & BRs for BP1
GetLocations(B) getListOfValidBlocks()
Free, Replicate B in BP1
Block Pool 2 Manager NS2
HB & BRs for BP2
Free, Replicate B in BP2
GetBlist(pathname)
23 Yahoo!
Grid1.corp ysearch.corp /
users projects data tmp
d1 d2 d3 p2 p1
.. .. Yahoo-inc.com Cmu.org
24 Yahoo!
– Scales both namespace and performance – Gives performance isolation between volumes – Well understood concepts
– Volume (NS + Block-pool) is self-contained unit with no dependencies on other volumes – Block-pools and volumes useful for other things
– Compatible with separation of Block maps (true for most other solutions) – Compatible with RO replicas (true for most other solutions)
– Manual separation of NSs (a few can be done automatically) – Not completely transparent (namespace structure changes unless you manually mount in right place)
– Managing multiple NNs
25 Yahoo!
– http://hadoop.apache.org/core/ – http://wiki.apache.org/hadoop/ – http://wiki.apache.org/hadoop/GettingStartedWithHadoop – http://wiki.apache.org/hadoop/HadoopMapReduce
– http://wiki.apache.org/hadoop/HdfsFutures
– http://research.yahoo.com/node/1980
– Sradia@yahoo-inc.com