HDFS Hadoop Distributed File System Motivation File Management - - PowerPoint PPT Presentation

hdfs
SMART_READER_LITE
LIVE PREVIEW

HDFS Hadoop Distributed File System Motivation File Management - - PowerPoint PPT Presentation

HDFS Hadoop Distributed File System Motivation File Management Streaming Data Fault Tolerance 1 Labs Run 227 October (four weeks) at these times: Monday 9am Monday 10am Tuesday 2pm Wednesday 10am Wednesday 2pm Thursday 9am Thursday


slide-1
SLIDE 1

HDFS

Hadoop Distributed File System

Motivation File Management Streaming Data Fault Tolerance

1

slide-2
SLIDE 2

Labs

Run 2–27 October (four weeks) at these times: Monday 9am Monday 10am Tuesday 2pm Wednesday 10am Wednesday 2pm Thursday 9am Thursday 11am Friday 11am Friday 2pm Lab groups will be chosen online: student.inf.ed.ac.uk.

Motivation File Management Streaming Data Fault Tolerance

2

slide-3
SLIDE 3

Distributed Map-Reduce

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Mapper 1 Mapper 2 Mapper 4 Mapper 3

Motivation File Management Streaming Data Fault Tolerance

3

slide-4
SLIDE 4

Large Data Sets

file sizes going up to petabytes

Motivation File Management Streaming Data Fault Tolerance

4

slide-5
SLIDE 5

How to get Data to Mappers?

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Mapper 1 Mapper 2 Mapper 4 Mapper 3

?

Motivation File Management Streaming Data Fault Tolerance

5

slide-6
SLIDE 6

How to get Data to Mappers?

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Mapper 1 Mapper 2 Mapper 4 Mapper 3

!

Motivation File Management Streaming Data Fault Tolerance

6

slide-7
SLIDE 7

Bring Mappers to Data!

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Mapper 1 Mapper 2 Mapper 4 Mapper 3

Motivation File Management Streaming Data Fault Tolerance

7

slide-8
SLIDE 8

But disk access latency is so high!

Motivation File Management Streaming Data Fault Tolerance

8

slide-9
SLIDE 9

But disk access latency is so high! Yes, but throughput is acceptable.

Motivation File Management Streaming Data Fault Tolerance

9

slide-10
SLIDE 10

Distributed File System

Motivation File Management Streaming Data Fault Tolerance

10

slide-11
SLIDE 11

Distributed File System

HDFS is a GFS (Google File System) clone

Motivation File Management Streaming Data Fault Tolerance

11

slide-12
SLIDE 12

HDFS Design Choices

1 Support handling of large files across multiple nodes Motivation File Management Streaming Data Fault Tolerance

12

slide-13
SLIDE 13

HDFS Design Choices

1 Support handling of large files across multiple nodes 2 Optimise for streaming access Motivation File Management Streaming Data Fault Tolerance

13

slide-14
SLIDE 14

HDFS Design Choices

1 Support handling of large files across multiple nodes 2 Optimise for streaming access 3 Run on commodity hardware (e.g. high fault tolerance) Motivation File Management Streaming Data Fault Tolerance

14

slide-15
SLIDE 15

Large Files

128 MB Block Size

Motivation File Management Streaming Data Fault Tolerance

15

slide-16
SLIDE 16

Why so large Blocks?

Motivation File Management Streaming Data Fault Tolerance

16

slide-17
SLIDE 17

HDFS datanode Linux file system

Motivation File Management Streaming Data Fault Tolerance

17

slide-18
SLIDE 18

HDFS datanode Linux file system Demo

Motivation File Management Streaming Data Fault Tolerance

18

slide-19
SLIDE 19

HDFS datanode Linux file system HDFS datanode Linux file system

HDFS namenode

File namespace /foo/bar

block 3df2 instructions to datanode datanode state

Motivation File Management Streaming Data Fault Tolerance

19

slide-20
SLIDE 20

Optimised for Streaming

Append Write Successive Read

write once read many

Motivation File Management Streaming Data Fault Tolerance

20

slide-21
SLIDE 21

HDFS datanode Linux file system HDFS datanode Linux file system

HDFS namenode

File namespace /foo/bar

block 3df2 instructions to datanode datanode state

Application HDFS Client

(file name, block id) (block id, block location) block data (block id, byte range) ctrl flow data flow

Motivation File Management Streaming Data Fault Tolerance

21

slide-22
SLIDE 22

HDFS datanode Linux file system HDFS datanode Linux file system

HDFS namenode

File namespace /foo/bar

block 3df2 instructions to datanode datanode state

Application HDFS Client

(file name, block id) (block id, block location) block data (block id, byte range) ctrl flow data flow

Why so large blocks?

Motivation File Management Streaming Data Fault Tolerance

22

slide-23
SLIDE 23

HDFS datanode Linux file system HDFS datanode Linux file system

HDFS namenode

File namespace /foo/bar

block 3df2 instructions to datanode datanode state

Application HDFS Client

(file name, block id) (block id, block location) block data (block id, byte range) ctrl flow data flow

Why so large blocks?

1 less communication between master and workers Motivation File Management Streaming Data Fault Tolerance

23

slide-24
SLIDE 24

HDFS datanode Linux file system HDFS datanode Linux file system

HDFS namenode

File namespace /foo/bar

block 3df2 instructions to datanode datanode state

Application HDFS Client

(file name, block id) (block id, block location) block data (block id, byte range) ctrl flow data flow

Why so large blocks?

1 less communication between master and workers 2 reduced communication between client and datanodes Motivation File Management Streaming Data Fault Tolerance

24

slide-25
SLIDE 25

HDFS datanode Linux file system HDFS datanode Linux file system

HDFS namenode

File namespace /foo/bar

block 3df2 instructions to datanode datanode state

Application HDFS Client

(file name, block id) (block id, block location) block data (block id, byte range) ctrl flow data flow

Why so large blocks?

1 less communication between master and workers 2 reduced communication between client and datanodes 3 less meta data to be saved in namenode Motivation File Management Streaming Data Fault Tolerance

25

slide-26
SLIDE 26

Which block location is best for the client? Application HDFS Client

block data

?

Motivation File Management Streaming Data Fault Tolerance

26

slide-27
SLIDE 27

Which block location is best for the client? Application HDFS Client

block data

?

The closest one!

Motivation File Management Streaming Data Fault Tolerance

27

slide-28
SLIDE 28

Network is represented as a tree. Distance between two nodes is the sum of their distance to their closest common ancestor.

Motivation File Management Streaming Data Fault Tolerance

28

slide-29
SLIDE 29

Fault Tolerance

Faults are the norm, not the exception.

Motivation File Management Streaming Data Fault Tolerance

29

slide-30
SLIDE 30

Hadoop keeps three versions by default.

Motivation File Management Streaming Data Fault Tolerance

30

slide-31
SLIDE 31

How to spread over across the cluster?

Motivation File Management Streaming Data Fault Tolerance

31

slide-32
SLIDE 32

How to spread over across the cluster?

Demo

Motivation File Management Streaming Data Fault Tolerance

32

slide-33
SLIDE 33

Anatomy of a Write

Motivation File Management Streaming Data Fault Tolerance

33

slide-34
SLIDE 34

Summary

1 HDFS handles large files across the cluster 2 HDFS is optimised for streaming access to files 3 HDFS runs on commodity hardware and needs to be fault

tolerant

Motivation File Management Streaming Data Fault Tolerance

34