Hadoop Distributed File System (HDFS) 10/05/2018 1 HDFS Overview - - PowerPoint PPT Presentation

hadoop distributed file system hdfs
SMART_READER_LITE
LIVE PREVIEW

Hadoop Distributed File System (HDFS) 10/05/2018 1 HDFS Overview - - PowerPoint PPT Presentation

Hadoop Distributed File System (HDFS) 10/05/2018 1 HDFS Overview A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture to many other common distributed storage engines such as Amazon


slide-1
SLIDE 1

Hadoop Distributed File System (HDFS)

10/05/2018 1

slide-2
SLIDE 2

HDFS Overview

A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture to many other common distributed storage engines such as Amazon S3 and Microsoft Azure HDFS is a stand-along storage engine and can be used in isolation of the query processing engine

10/05/2018 2

slide-3
SLIDE 3

HDFS Architecture

10/05/2018

B B B B B B B B B B B B B B B Name node Data nodes

3

slide-4
SLIDE 4

What is where?

10/05/2018

B B B B B B B B B B B B B B B Name node Data nodes File and directory names Block ordering and locations Capacity of data nodes Architecture of data nodes Block data Name node location

4

slide-5
SLIDE 5

Analogy to Unix FS

10/05/2018

The logical view is similar

/ user mary chu etc hadoop

5

slide-6
SLIDE 6

Analogy to Unix FS

10/05/2018

The physical model is comparable

Unix HFDS

File1 List of iNodes Block 1 Block 2 Block 3 … File1 List of block locations Meta data B B B B B B B B B B B B B B B

6

slide-7
SLIDE 7

HDFS Create

10/05/2018

Data nodes File creator Name node

7

slide-8
SLIDE 8

HDFS Create

10/05/2018

Data nodes File creator Create(…) Name node The creator process calls the create function which translates to an RPC call at the name node

8

slide-9
SLIDE 9

HDFS Create

10/05/2018

Name node Data nodes File creator Create(…) The master node creates three initial blocks

  • 1. First block is assigned to a random

machine

  • 2. Second block is assigned to another

random machine in the same rack of the first machine

  • 3. Third block is assigned to a random

machine in another rack 1 2 3

9

slide-10
SLIDE 10

HDFS Create

10/05/2018

Name node Data nodes File creator OutputStream 1 2 3

10

slide-11
SLIDE 11

HDFS Create

10/05/2018

Name node Data nodes File creator 1 2 3 OutputStream#write

11

slide-12
SLIDE 12

HDFS Create

10/05/2018

Name node Data nodes File creator 1 2 3 OutputStream#write

12

slide-13
SLIDE 13

HDFS Create

10/05/2018

Name node Data nodes File creator 1 2 3 OutputStream#write

13

slide-14
SLIDE 14

HDFS Create

10/05/2018

Name node Data nodes File creator 1 2 3 OutputStream#write When a block is filled up, the creator contacts the name node to create the next block Next block

14

slide-15
SLIDE 15

Notes about writing to HDFS

Data transfers of replicas are pipelined The data does not go through the name node Random writing is not supported Appending to a file is supported but it creates a new block

10/05/2018 15

slide-16
SLIDE 16

Self-writing

10/05/2018

Name node Data nodes File creator If the file creator is running on one

  • f the data nodes, the first replica

is always assigned to that node

16

slide-17
SLIDE 17

Reading from HDFS

Reading is relatively easier No replication is needed Replication can be exploited Random reading is allowed

10/05/2018 17

slide-18
SLIDE 18

HDFS Read

10/05/2018

Data nodes File reader

  • pen(…)

Name node The reader process calls the open function which translates to an RPC call at the name node

18

slide-19
SLIDE 19

HDFS Read

10/05/2018

Data nodes File reader InputStream Name node The name node locates the first block

  • f that file and returns the address of
  • ne of the nodes that store that block

The name node returns an input stream for the file

19

slide-20
SLIDE 20

HDFS Read

10/05/2018

Data nodes File reader InputStream#read(…) Name node

20

slide-21
SLIDE 21

HDFS Read

10/05/2018

Data nodes File reader Name node When an end-of-block is reached, the name node locates the next block Next block

21

slide-22
SLIDE 22

HDFS Read

10/05/2018

Data nodes File reader Name node seek(pos) InputStream#seek operation locates a block and positions the stream accordingly

22

slide-23
SLIDE 23

Self-reading

10/05/2018

Data nodes File reader Name node

  • 1. If the block is locally stored
  • n the reader, this replica is

chosen to read

  • 2. If not, a replica on another

machine in the same rack is chosen

  • 3. Any other random block is

chosen Open, seek

23

When self-reading occurs, HDFS can make it much faster through a feature called short-circuit

slide-24
SLIDE 24

Notes About Reading

The API is much richer than the simple

  • pen/seek/close API

You can retrieve block locations You can choose a specific replica to read

The same API is generalized to other file systems including the local FS and S3 Review question: Compare random access read in local file systems to HDFS

10/05/2018 24

slide-25
SLIDE 25

HDFS Special Features

Node decomission Load balancer Cheap concatenation

10/05/2018 25

slide-26
SLIDE 26

Node Decommission

10/05/2018 26

B B B B B B B B B B B B B B B B B B B

slide-27
SLIDE 27

Load Balancing

10/05/2018 27

B B B B B B B B B B B B B B B

slide-28
SLIDE 28

Load Balancing

10/05/2018 28

B B B B B B B B B B B B B B B

Start the load balancer

slide-29
SLIDE 29

Cheap Concatenation

10/05/2018 29

Name node File 1 File 2 File 3 Concatenate File 1 + File 2 + File 3  File 4 Rather than creating new blocks, HDFS can just change the metadata in the name node to delete File 1, File 2, and File 3, and assign their blocks to a new File 4 in the right order.

slide-30
SLIDE 30

HDFS API

10/05/2018 30

FileSystem DistributedFileSystem LocalFileSystem S3FileSystem Path Configuration

slide-31
SLIDE 31

HDFS API

10/05/2018 31

Configuration conf = new Configuration(); Path path = new Path(“…”); FileSystem fs = path.getFileSystem(conf); // To get the local FS fs = FileSystem.getLocal (conf); // To get the default FS fs = FileSystem.get(conf);

Create the file system

slide-32
SLIDE 32

HDFS API

10/05/2018 32

FSDataOutputStream out = fs.create(path, …);

Create a new file

fs.delete(path, recursive); fs.deleteOnExit(path);

Delete a file

fs.rename(oldPath, newPath);

Rename a file

slide-33
SLIDE 33

HDFS API

10/05/2018 33

FSDataInputStream in = fs.open(path, …);

Open a file

in.seek(pos); in.seekToNewSource(pos);

Seek to a different location

slide-34
SLIDE 34

HDFS API

10/05/2018 34

fs.concat(destination, src[]);

Concatenate

fs.getFileStatus(path);

Get file metadata

fs.getFileBlockLocations(path, from, to);

Get block locations