 
              HDFS Hadoop Distributed File System Motivation File Management Streaming Data Fault Tolerance 1
Labs Run 2–27 October (four weeks) at these times: Monday 9am Monday 10am Tuesday 2pm Wednesday 10am Wednesday 2pm Thursday 9am Thursday 11am Friday 11am Friday 2pm Lab groups will be chosen online: student.inf.ed.ac.uk. Motivation File Management Streaming Data Fault Tolerance 2
Distributed Map-Reduce Node 1 Node 2 Node 3 Mapper 1 Node 4 Node 5 Node 6 Mapper 2 Mapper 3 Node 7 Node 8 Node 9 Mapper 4 Motivation File Management Streaming Data Fault Tolerance 3
Large Data Sets file sizes going up to petabytes Motivation File Management Streaming Data Fault Tolerance 4
How to get Data to Mappers? Node 1 Node 2 Node 3 ? Mapper 1 Node 4 Node 5 Node 6 Mapper 2 Mapper 3 Node 7 Node 8 Node 9 Mapper 4 Motivation File Management Streaming Data Fault Tolerance 5
How to get Data to Mappers? Node 1 Node 2 Node 3 ! Mapper 1 Node 4 Node 5 Node 6 Mapper 2 Mapper 3 Node 7 Node 8 Node 9 Mapper 4 Motivation File Management Streaming Data Fault Tolerance 6
Bring Mappers to Data! Node 1 Node 2 Node 3 Mapper 1 Node 4 Node 5 Node 6 Mapper 2 Mapper 3 Node 7 Node 8 Node 9 Mapper 4 Motivation File Management Streaming Data Fault Tolerance 7
But disk access latency is so high! Motivation File Management Streaming Data Fault Tolerance 8
But disk access latency is so high! Yes, but throughput is acceptable. Motivation File Management Streaming Data Fault Tolerance 9
Distributed File System Motivation File Management Streaming Data Fault Tolerance 10
Distributed File System HDFS is a GFS (Google File System) clone Motivation File Management Streaming Data Fault Tolerance 11
HDFS Design Choices 1 Support handling of large files across multiple nodes Motivation File Management Streaming Data Fault Tolerance 12
HDFS Design Choices 1 Support handling of large files across multiple nodes 2 Optimise for streaming access Motivation File Management Streaming Data Fault Tolerance 13
HDFS Design Choices 1 Support handling of large files across multiple nodes 2 Optimise for streaming access 3 Run on commodity hardware (e.g. high fault tolerance) Motivation File Management Streaming Data Fault Tolerance 14
Large Files 128 MB Block Size Motivation File Management Streaming Data Fault Tolerance 15
Why so large Blocks? Motivation File Management Streaming Data Fault Tolerance 16
HDFS datanode Linux file system Motivation File Management Streaming Data Fault Tolerance 17
HDFS datanode Linux file system Demo Motivation File Management Streaming Data Fault Tolerance 18
HDFS namenode File namespace /foo/bar block 3df2 instructions to datanode datanode state HDFS datanode HDFS datanode Linux file system Linux file system Motivation File Management Streaming Data Fault Tolerance 19
Optimised for Streaming Successive Read Append Write write once read many Motivation File Management Streaming Data Fault Tolerance 20
HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Motivation File Management Streaming Data Fault Tolerance 21
HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Why so large blocks? Motivation File Management Streaming Data Fault Tolerance 22
HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Why so large blocks? 1 less communication between master and workers Motivation File Management Streaming Data Fault Tolerance 23
HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Why so large blocks? 1 less communication between master and workers 2 reduced communication between client and datanodes Motivation File Management Streaming Data Fault Tolerance 24
HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Why so large blocks? 1 less communication between master and workers 2 reduced communication between client and datanodes 3 less meta data to be saved in namenode Motivation File Management Streaming Data Fault Tolerance 25
Which block location is best for the client? ? Application HDFS Client block data Motivation File Management Streaming Data Fault Tolerance 26
Which block location is best for the client? ? Application HDFS Client block data The closest one! Motivation File Management Streaming Data Fault Tolerance 27
Network is represented as a tree. Distance between two nodes is the sum of their distance to their closest common ancestor. Motivation File Management Streaming Data Fault Tolerance 28
Fault Tolerance Faults are the norm, not the exception. Motivation File Management Streaming Data Fault Tolerance 29
Hadoop keeps three versions by default. Motivation File Management Streaming Data Fault Tolerance 30
How to spread over across the cluster? Motivation File Management Streaming Data Fault Tolerance 31
How to spread over across the cluster? Demo Motivation File Management Streaming Data Fault Tolerance 32
Anatomy of a Write Motivation File Management Streaming Data Fault Tolerance 33
Summary 1 HDFS handles large files across the cluster 2 HDFS is optimised for streaming access to files 3 HDFS runs on commodity hardware and needs to be fault tolerant Motivation File Management Streaming Data Fault Tolerance 34
Recommend
More recommend