Big Data Processing Technologies Chentao Wu Associate Professor - - PowerPoint PPT Presentation

big data processing technologies
SMART_READER_LITE
LIVE PREVIEW

Big Data Processing Technologies Chentao Wu Associate Professor - - PowerPoint PPT Presentation

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Schedule lec1: Introduction on big data and cloud computing Iec2: Introduction on data storage lec3: Data


slide-1
SLIDE 1

Big Data Processing Technologies

Chentao Wu Associate Professor

  • Dept. of Computer Science and Engineering

wuct@cs.sjtu.edu.cn

slide-2
SLIDE 2

Schedule

  • lec1: Introduction on big data and cloud

computing

  • Iec2: Introduction on data storage
  • lec3: Data reliability (Replication/Archive/EC)
  • lec4: Data consistency problem
  • lec5: Block level storage and file storage
  • lec6: Object-based storage
  • lec7: Distributed file system
  • lec8: Metadata management
slide-3
SLIDE 3

Collaborators

slide-4
SLIDE 4

Data Reliability Problem (1) Google – Disk Annual Failure Rate

slide-5
SLIDE 5

Data Reliability Problem (2) Facebook-- Failure nodes in a 3000 nodes cluster

slide-6
SLIDE 6

Contents

Introduction on Replication

1

slide-7
SLIDE 7

What is Replication?

  • Replication can be classified as
  • Local replication
  • Replicating data within the same array or data center
  • Remote replication
  • Replicating data at remote site

It is a process of creating an exact copy (replica) of data.

Replication

Source Replica (Target)

REPLICATION

slide-8
SLIDE 8

File System Consistency: Flushing Host Buffer

File System Application Memory Buffers Logical Volume Manager Physical Disk Driver Data

Flush Buffer

Source Replica

slide-9
SLIDE 9

Database Consistency: Dependent Write I/O Principle

D Inconsistent C

Consistent

Source Replica 4 4 3 3 2 2 1 1 4 4 3 3 2 1

C

Source Replica

slide-10
SLIDE 10

Host-based Replication: LVM-based Mirroring

C C

Host Logical Volume Physical Volume 1 Physical Volume 2

  • LVM: Logical Volume Manager
slide-11
SLIDE 11

Host-based Replication: File System Snapshot

C C

  • Pointer-based

replication

  • Uses Copy on First

Write (CoFW) principle

  • Uses bitmap and block

map

  • Requires a fraction of

the space used by the production FS

Metadata Production FS Metadata 1 Data a 2 Data b FS Snapshot 3 no data 4 no data BLK Bit 1-0 1-0 2-0 2-0 N Data N 3 Data C 2 Data c 3-1 4 Data D 1 Data d 4-1 3-2 4-1

slide-12
SLIDE 12

Storage Array-based Local Replication

C C

  • Replication performed by the array operating

environment

  • Source and replica are on the same array
  • Types of array-based replication
  • Full-volume mirroring
  • Pointer-based full-volume replication
  • Pointer-based virtual replication

BC Host Storage Array Replica Source Production Host

slide-13
SLIDE 13

Full-Volume Mirroring

Source

Attached

Storage Array

Read/Write Not Ready

Production Host BC Host Target

Detached – Point In Time

Read/Write Read/Write

Source Storage Array Production Host BC Host Target

slide-14
SLIDE 14

Copy on First Access: Write to the Source

Source C’ Target

  • When a write is issued to the source for the first time after replication

session activation:

 Original data at that address is copied to the target  Then the new data is updated on the source  This ensures that original data at the point-in-time of activation is

preserved on the target

Production Host BC Host C Write to Source

A B C’ C

slide-15
SLIDE 15

Copy on First Access: Write to the Target

  • When a write is issued to the target for the first time after replication

session activation:

 The original data is copied from the source to the target  Then the new data is updated on the target

Source B’ Target Production Host BC Host B Write to Target

A B C’ C B’

slide-16
SLIDE 16

Copy on First Access: Read from Target

  • When a read is issued to the target for the first time after replication

session activation:

 The original data is copied from the source to the target and is made

available to the BC host

Source A Target Production Host BC Host A Read request for data “A”

A B C’ C B’ A

slide-17
SLIDE 17

Tracking Changes to Source and Target

Source Target

unchanged changed

Logical OR At PIT Target Source After PIT… 1 1 1 1 1 1 1 1 1 1 1

1

For resynchronization/restore

slide-18
SLIDE 18

Contents

Introduction to Erasure Codes

2

slide-19
SLIDE 19

Erasure Coding Basis (1)

  • You've got some data
  • And a collection of storage

nodes.

  • And you want to store the data on the storage nodes so that

you can get the data back, even when the nodes fail..

slide-20
SLIDE 20

Erasure Coding Basis (2)

  • More concrete: You have k

disks worth of data

  • And n total disks.
  • The erasure code tells you how to create n disks worth of

data+coding so that when disks fail, you can still get the data

slide-21
SLIDE 21

Erasure Coding Basis (3)

  • You have k disks worth of

data

  • And n total disks.
  • n = k + m
  • A systematic erasure code stores the data in the clear on k of

the n disks. There are k data disks, and m coding or “parity”

  • disks.  Horizontal Code
slide-22
SLIDE 22

Erasure Coding Basis (4)

  • You have k disks worth of

data

  • And n total disks.
  • n = k + m
  • A non-systematic erasure code stores only coding information,

but we still use k, m, and n to describe the code.  Vertical Code

slide-23
SLIDE 23

Erasure Coding Basis (5)

  • You have k disks worth of

data

  • And n total disks.
  • n = k + m
  • When disks fail, their contents become unusable, and

the storage system detects this. This failure mode is called an erasure.

slide-24
SLIDE 24

Erasure Coding Basis (6)

  • You have k disks worth of

data

  • And n total disks.
  • n = k + m
  • An MDS (“Maximum Distance Separable”) code can reconstruct

the data from any m failures.  Optimal

  • Can reconstruct any f failures (f < m)  non-MDS code
slide-25
SLIDE 25

Two Views of a Stripe (1)

  • The Theoretical View:

– The minimum collection of bits that encode and decode together. – r rows of w-bit symbols from each of n disks:

slide-26
SLIDE 26

Two Views of a Stripe (2)

  • The Systems View:

– The minimum partition of the system that encodes and decodes together. – Groups together theoretical stripes for performance.

slide-27
SLIDE 27

Horizontal & Vertical Codes

  • Horizontal Code
  • Vertical Code
slide-28
SLIDE 28

Expressing Code with Generator Matrix (1)

slide-29
SLIDE 29

Expressing Code with Generator Matrix (2)

slide-30
SLIDE 30

Expressing Code with Generator Matrix (3)

slide-31
SLIDE 31

Encoding— Linux RAID-6 (1)

slide-32
SLIDE 32

Encoding— Linux RAID-6 (2)

slide-33
SLIDE 33

Encoding— Linux RAID-6 (3)

slide-34
SLIDE 34

Accelerate Encoding— Linux RAID-6

slide-35
SLIDE 35

Encoding— RDP (1)

slide-36
SLIDE 36

Encoding— RDP (2)

slide-37
SLIDE 37

Encoding— RDP (3)

slide-38
SLIDE 38

Encoding— RDP (4)

slide-39
SLIDE 39

Encoding— RDP (5)

slide-40
SLIDE 40

Encoding— RDP (6)

Horizontal Parity Diagonal Parity Data 1 2 3 4 5 6 7 1 2 3 4 5

  • Horizontal parity layout (p=7, n=8)
slide-41
SLIDE 41

Encoding— RDP (7)

  • Diagonal parity layout (p=7, n=8)

Horizontal Parity Diagonal Parity Data 1 2 3 4 5 6 7 1 2 3 4 5

slide-42
SLIDE 42

Arithmetic for Erasure Codes

  • When w = 1: XOR's only.
  • Otherwise, Galois Field Arithmetic GF(2w)

– w is 2, 4, 8, 16, 32, 64, 128 so that words fit evenly into

computer words. – Addition is equal to XOR.

Nice because addition equals subtraction.

– Multiplication is more complicated:

Gets more expensive as w grows. Buffer-constant different from a * b. Buffer * 2 can be done really fast. Open source library support.

slide-43
SLIDE 43

Decoding with Generator Matrices (1)

slide-44
SLIDE 44

Decoding with Generator Matrices (2)

slide-45
SLIDE 45

Decoding with Generator Matrices (3)

slide-46
SLIDE 46

Decoding with Generator Matrices (4)

slide-47
SLIDE 47

Decoding with Generator Matrices (5)

slide-48
SLIDE 48

Erasure Codes — Reed Solomon (1)

  • Given in 1960.
  • MDS Erasure codes for any n and k.

– That means any m = (n-k) failures can be tolerated without data loss.

  • r = 1

(Theoretical): One word per disk per stripe.

  • w constrained so that n ≤ 2w.
  • Systematic and non-systematic forms.
slide-49
SLIDE 49

Erasure Codes —Reed Solomon (2) Systematic RS -- Cauchy generator matrix

slide-50
SLIDE 50

Erasure Codes —Reed Solomon (3) Non-Systematic RS -- Vandermonde generator matrix

slide-51
SLIDE 51

Erasure Codes —Reed Solomon (4) Non-Systematic RS -- Vandermonde generator matrix

slide-52
SLIDE 52

Erasure Codes —EVENODD 1995 (7 disks, tolerating 2 disk failures)

  • Horizontal Parity Coding
  • Calculated by the data

elements in the same row

  • E.g. 𝐷0,5 = 𝐷0,0 ⊕ 𝐷0,1 ⊕ 𝐷0,2 ⊕ 𝐷0,3

⊕ 𝐷0,4

  • Diagonal Parity Coding
  • Calculated by the data

elements and S

  • E.g. 𝐷0,6 = 𝐷0,0 ⊕ 𝐷3,2 ⊕ 𝐷2,3 ⊕

𝐷1,4 ⊕ 𝑇

slide-53
SLIDE 53

Erasure Codes —X-Code 1999 (1)

  • Diagonal parity layout (p=7, n=7)

Diagonal Parity Anti-diagonal Parity Data 1 2 3 4 5 6 1 2 3 4 5 6

slide-54
SLIDE 54

Erasure Codes —X-Code 1999 (2)

  • Anti-diagonal parity layout (p=7, n=7)

Diagonal Parity Anti-diagonal Parity Data 1 2 3 4 5 6 1 2 3 4 5 6

slide-55
SLIDE 55

Erasure Codes —H-Code (1)

  • Horizontal parity layout (p=7, n=8)

Horizontal Parity Anti-diagonal Parity Data 1 2 3 4 5 6 7 1 2 3 4 5

slide-56
SLIDE 56

Erasure Codes —H-Code (2)

  • Anti-diagonal parity layout (p=7, n=8)

Horizontal Parity Anti-diagonal Parity Data 1 2 3 4 5 6 7 1 2 3 4 5

slide-57
SLIDE 57

Erasure Codes —H-Code (3)

  • Recover double disk failure by single recovery chain

Horizontal Parity Anti-diagonal Parity Data Lost Data and Parity

Recovery Chain

1 2 3 4 5 6 7 8 9

10 11 12

X

F H J L B D K E A I G C 1 2 3 4 5 6 7 1 2 3 4 5

slide-58
SLIDE 58

Erasure Codes —H-Code (4)

  • Recover double disk failure by two recovery chains

5 Horizontal Parity Anti-diagonal Parity Data Lost Data and Parity

Recovery Chain

1 2 3 4 5 6

X

D E L J K I H F G A B C 1 2 3 4 5 6 7 1 2 3 4 5 1 2 3 4 6

X

slide-59
SLIDE 59

Erasure Codes —HDP Code (1)

  • Diagonal parity layout (p=7, n=6)

1 2 3 4 5 1 2 3 4 5 HDP ADP Data

slide-60
SLIDE 60

Erasure Codes —HDP Code (2)

  • Diagonal parity layout (p=7, n=6)

1 2 3 4 5 1 2 3 4 5 HDP ADP Data

slide-61
SLIDE 61

Erasure Codes —HDP Code (3)

  • HDP reduces more than 30% average recovery time.

1 2 3 4 5 1 2 3 4 5 HDP ADP Data Lost Data and Parity

A 1 B C D E F K L I J G H F F F F 2 3 4 5 6 1 2 3 4 5 6

slide-62
SLIDE 62

Contents

Replication and EC in Cloud

3

slide-63
SLIDE 63

Three Dimensions in Cloud Storage

slide-64
SLIDE 64

Replication vs Erasure Coding (RS)

slide-65
SLIDE 65

Fundamental Tradeoff

slide-66
SLIDE 66

Pyramid Codes (1)

slide-67
SLIDE 67

Pyramid Codes (2)

slide-68
SLIDE 68

Pyramid Codes (3) Multiple Hierachies

slide-69
SLIDE 69

Pyramid Codes (4) Multiple Hierachies

slide-70
SLIDE 70

Pyramid Codes (5) Multiple Hierachies

slide-71
SLIDE 71

Pyramid Codes (6)

slide-72
SLIDE 72

Google GFS II – Based on RS

slide-73
SLIDE 73

Microsoft Azure (1) How to Reduce Cost?

slide-74
SLIDE 74

Microsoft Azure (2) Recovery becomes expensive

slide-75
SLIDE 75

Microsoft Azure (3) Best of both worlds?

slide-76
SLIDE 76

Microsoft Azure (4) Local Reconstruction Code (LRC)

slide-77
SLIDE 77

Microsoft Azure (5) Analysis LRC vs RS

slide-78
SLIDE 78

Microsoft Azure (6) Analysis LRC vs RS

slide-79
SLIDE 79

Recovery problem in Cloud

  • Recovery I/Os from 6 disks (high network bandwidth)
slide-80
SLIDE 80

Optimizing Recovery Network I/O (1)

slide-81
SLIDE 81

Optimizing Recovery Network I/O (1)

  • Establish recovery relationships among disks
slide-82
SLIDE 82

Optimizing Recovery I/O (3)

  • ~20+% savings in general
slide-83
SLIDE 83

Regenerating Codes (1)

  • Data = {a,b,c}
slide-84
SLIDE 84

Regenerating Codes (2)

  • Optimal Repair
slide-85
SLIDE 85

Regenerating Codes (3)

  • Optimal Repair
slide-86
SLIDE 86

Regenerating Codes (4)

  • Optimal Repair
slide-87
SLIDE 87

Regenerating Codes (4) Analysis -- Regenerating vs RS

slide-88
SLIDE 88

Facebook Xorbas Hadoop Locally Repairable Codes

slide-89
SLIDE 89

Combination of Two ECs (1) Recovery Cost vs. Storage Overhead

slide-90
SLIDE 90

Combination of Two ECs (2) Fast Code and Compact Code

slide-91
SLIDE 91

Combination of Two ECs (3) Analysis

slide-92
SLIDE 92

Combination of Two ECs (4) Analysis

slide-93
SLIDE 93

Combination of Two ECs (5) Analysis

slide-94
SLIDE 94

Combination of Two ECs (6) Conversion

  • Horizontal parities require no re-computation
  • Vertical parities require no data block transfer
  • All parity updates can be done in parallel and in a distributed

manner

slide-95
SLIDE 95

Combination of Two ECs (7) Results

slide-96
SLIDE 96

Contents

Project 1

4

slide-97
SLIDE 97

Erasure Code in Hadoop (1)

  • Implement an erasure code into Hadoop system
  • Hadoop Version: 2.7 or higher
  • Erasure Code: you can select one, but not RS
  • Test the storage efficiency of your proposed code
  • Report and Source Code are required
  • Source Code should be checked by TA
  • Deadline: June 30th
slide-98
SLIDE 98

Erasure Code in Hadoop (2)

  • References
  • Jerasure

http://web.eecs.utk.edu/~plank/plank/www/software.html

  • HDFS-Xorbas

http://smahesh.com/HadoopUSC/

slide-99
SLIDE 99

Thank you!