Hadoop Distributed File System (HDFS)
10/05/2018 1
Hadoop Distributed File System (HDFS) 10/05/2018 1 HDFS Overview - - PowerPoint PPT Presentation
Hadoop Distributed File System (HDFS) 10/05/2018 1 HDFS Overview A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture to many other common distributed storage engines such as Amazon
10/05/2018 1
10/05/2018 2
10/05/2018
B B B B B B B B B B B B B B B Name node Data nodes
3
10/05/2018
B B B B B B B B B B B B B B B Name node Data nodes File and directory names Block ordering and locations Capacity of data nodes Architecture of data nodes Block data Name node location
4
10/05/2018
5
10/05/2018
File1 List of iNodes Block 1 Block 2 Block 3 … File1 List of block locations Meta data B B B B B B B B B B B B B B B
6
10/05/2018
Data nodes File creator Name node
7
10/05/2018
Data nodes File creator Create(…) Name node The creator process calls the create function which translates to an RPC call at the name node
8
10/05/2018
Name node Data nodes File creator Create(…) The master node creates three initial blocks
machine
random machine in the same rack of the first machine
machine in another rack 1 2 3
9
10/05/2018
Name node Data nodes File creator OutputStream 1 2 3
10
10/05/2018
Name node Data nodes File creator 1 2 3 OutputStream#write
11
10/05/2018
Name node Data nodes File creator 1 2 3 OutputStream#write
12
10/05/2018
Name node Data nodes File creator 1 2 3 OutputStream#write
13
10/05/2018
Name node Data nodes File creator 1 2 3 OutputStream#write When a block is filled up, the creator contacts the name node to create the next block Next block
14
10/05/2018 15
10/05/2018
Name node Data nodes File creator If the file creator is running on one
is always assigned to that node
16
10/05/2018 17
10/05/2018
Data nodes File reader
Name node The reader process calls the open function which translates to an RPC call at the name node
18
10/05/2018
Data nodes File reader InputStream Name node The name node locates the first block
The name node returns an input stream for the file
19
10/05/2018
Data nodes File reader InputStream#read(…) Name node
20
10/05/2018
Data nodes File reader Name node When an end-of-block is reached, the name node locates the next block Next block
21
10/05/2018
Data nodes File reader Name node seek(pos) InputStream#seek operation locates a block and positions the stream accordingly
22
10/05/2018
Data nodes File reader Name node
chosen to read
machine in the same rack is chosen
chosen Open, seek
23
When self-reading occurs, HDFS can make it much faster through a feature called short-circuit
10/05/2018 24
10/05/2018 25
10/05/2018 26
B B B B B B B B B B B B B B B B B B B
10/05/2018 27
B B B B B B B B B B B B B B B
10/05/2018 28
B B B B B B B B B B B B B B B
10/05/2018 29
Name node File 1 File 2 File 3 Concatenate File 1 + File 2 + File 3 File 4 Rather than creating new blocks, HDFS can just change the metadata in the name node to delete File 1, File 2, and File 3, and assign their blocks to a new File 4 in the right order.
10/05/2018 30
FileSystem DistributedFileSystem LocalFileSystem S3FileSystem Path Configuration
10/05/2018 31
Configuration conf = new Configuration(); Path path = new Path(“…”); FileSystem fs = path.getFileSystem(conf); // To get the local FS fs = FileSystem.getLocal (conf); // To get the default FS fs = FileSystem.get(conf);
10/05/2018 32
FSDataOutputStream out = fs.create(path, …);
fs.delete(path, recursive); fs.deleteOnExit(path);
fs.rename(oldPath, newPath);
10/05/2018 33
FSDataInputStream in = fs.open(path, …);
in.seek(pos); in.seekToNewSource(pos);
10/05/2018 34
fs.concat(destination, src[]);
fs.getFileStatus(path);
fs.getFileBlockLocations(path, from, to);