Hadoop Distributed File System (HDFS) 1 HDFS Overview A - - PowerPoint PPT Presentation

▶

Oct 05, 2022 255 likes •600 views

Hadoop Distributed File System (HDFS) 1 HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares a similar architecture to many other common distributed storage engines such as Amazon S3 and

SLIDE 1

Hadoop Distributed File System (HDFS)

SLIDE 2

HDFS Overview

A distributed file system Built on the architecture of Google File System (GFS) Shares a similar architecture to many other common distributed storage engines such as Amazon S3 and Microsoft Azure HDFS is a stand-alone storage engine and can be used in isolation of the query processing engine

SLIDE 3

HDFS Architecture

B B B B B B B B B B B B B B B Name node Data nodes

SLIDE 4

What is where?

B B B B B B B B B B B B B B B Name node Data nodes File and directory names Block ordering and locations Capacity of data nodes Architecture of data nodes Block data Name node location

SLIDE 5

Analogy to Unix FS

The logical view is similar

/ user mary chu etc hadoop

SLIDE 6

Analogy to Unix FS

The physical model is comparable

Unix HFDS

File1 List of iNodes Block 1 Block 2 Block 3 … File1 List of block locations Meta data B B B B B B B B B B B B B B B

SLIDE 7

HDFS Create

Data nodes File creator Name node

SLIDE 8

HDFS Create

Data nodes File creator Create(…) Name node The creator process calls the create function which translates to an RPC call at the name node

SLIDE 9

HDFS Create

Name node Data nodes File creator Create(…) The master node creates three initial blocks

1. First block is assigned to a random

machine

2. Second block is assigned to another

random machine in the same rack of the first machine

3. Third block is assigned to a random

machine in another rack 1 2 3

SLIDE 10

HDFS Create

Name node Data nodes File creator OutputStream 1 2 3

SLIDE 11

HDFS Create

Name node Data nodes File creator 1 2 3 OutputStream#write

SLIDE 12

HDFS Create

Name node Data nodes File creator 1 2 3 OutputStream#write

SLIDE 13

HDFS Create

Name node Data nodes File creator 1 2 3 OutputStream#write

SLIDE 14

HDFS Create

Name node Data nodes File creator 1 2 3 OutputStream#write When a block is filled up, the creator contacts the name node to create the next block Next block

SLIDE 15

Notes about writing to HDFS

Data transfers of replicas are pipelined The data does not go through the name node Random writing is not supported Appending to a file is supported but it creates a new block

SLIDE 16

Self-writing

Name node Data nodes File creator If the file creator is running on one

f the data nodes, the first replica

is always assigned to that node

SLIDE 17

Reading from HDFS

Reading is relatively easier No replication is needed Replication can be exploited Random reading is allowed

SLIDE 18

HDFS Read

Data nodes File reader

pen(…)

Name node The reader process calls the open function which translates to an RPC call at the name node

SLIDE 19

HDFS Read

Data nodes File reader InputStream Name node The name node locates the first block

f that file and returns the address of
ne of the nodes that store that block

The name node returns an input stream for the file

SLIDE 20

HDFS Read

Data nodes File reader InputStream#read(…) Name node

SLIDE 21

HDFS Read

Data nodes File reader Name node When an end-of-block is reached, the name node locates the next block Next block

SLIDE 22

HDFS Read

Data nodes File reader Name node seek(pos) InputStream#seek operation locates a block and positions the stream accordingly

SLIDE 23

Self-reading

Data nodes File reader Name node

1. If the block is locally stored
n the reader, this replica is

chosen to read

2. If not, a replica on another

machine in the same rack is chosen

3. Any other random block is

chosen Open, seek

When self-reading occurs, HDFS can make it much faster through a feature called short-circuit

SLIDE 24

Notes About Reading

The API is much richer than the simple

pen/seek/close API

You can retrieve block locations You can choose a specific replica to read

The same API is generalized to other file systems including the local FS and S3 Review question: Compare random access read in local file systems to HDFS

SLIDE 25

HDFS Special Features

Node decomission Load balancer Cheap concatenation

SLIDE 26

Node Decommission

B B B B B B B B B B B B B B B B B B B

SLIDE 27

Load Balancing

B B B B B B B B B B B B B B B

SLIDE 28

Load Balancing

B B B B B B B B B B B B B B B

Start the load balancer

SLIDE 29

Cheap Concatenation

Name node File 1 File 2 File 3 Concatenate File 1 + File 2 + File 3 ➔ File 4 Rather than creating new blocks, HDFS can just change the metadata in the name node to delete File 1, File 2, and File 3, and assign their blocks to a new File 4 in the right order.

SLIDE 30

HDFS API

FileSystem DistributedFileSystem LocalFileSystem S3FileSystem Path Configuration

SLIDE 31

HDFS API

Configuration conf = new Configuration(); Path path = new Path(“…”); FileSystem fs = path.getFileSystem(conf); // To get the local FS fs = FileSystem.getLocal (conf); // To get the default FS fs = FileSystem.get(conf);

Create the file system

SLIDE 32

HDFS API

FSDataOutputStream out = fs.create(path, …);

Create a new file

fs.delete(path, recursive); fs.deleteOnExit(path);

Delete a file

fs.rename(oldPath, newPath);

Rename a file

SLIDE 33

HDFS API

FSDataInputStream in = fs.open(path, …);

Open a file

in.seek(pos); in.seekToNewSource(pos);

Seek to a different location

SLIDE 34

HDFS API

fs.concat(destination, src[]);

Concatenate

fs.getFileStatus(path);

Get file metadata

fs.getFileBlockLocations(path, from, to);

Hadoop Distributed File System (HDFS)

HDFS Overview

A distributed file system Built on the architecture of Google File System (GFS) Shares a similar architecture to many other common distributed storage engines such as Amazon S3 and Microsoft Azure HDFS is a stand-alone storage engine and can be used in isolation of the query processing engine

HDFS Architecture

What is where?

Analogy to Unix FS

The logical view is similar

/ user mary chu etc hadoop

Analogy to Unix FS

The physical model is comparable

Unix HFDS

HDFS Create

HDFS Create

HDFS Create

HDFS Create

HDFS Create

HDFS Create

HDFS Create

HDFS Create

Notes about writing to HDFS

Data transfers of replicas are pipelined The data does not go through the name node Random writing is not supported Appending to a file is supported but it creates a new block

Self-writing

Reading from HDFS

Reading is relatively easier No replication is needed Replication can be exploited Random reading is allowed

HDFS Read

HDFS Read

HDFS Read

HDFS Read

HDFS Read

Self-reading

Notes About Reading

The API is much richer than the simple

You can retrieve block locations You can choose a specific replica to read

The same API is generalized to other file systems including the local FS and S3 Review question: Compare random access read in local file systems to HDFS

HDFS Special Features

Node decomission Load balancer Cheap concatenation

Node Decommission

Load Balancing

Load Balancing

Start the load balancer

Cheap Concatenation

HDFS API

HDFS API

Create the file system

HDFS API

Create a new file

Delete a file

Rename a file

HDFS API

Open a file

Seek to a different location

HDFS API

Concatenate

Get file metadata

Get block locations