SLIDE 1 Big Data Processing Technologies
Chentao Wu Associate Professor
- Dept. of Computer Science and Engineering
wuct@cs.sjtu.edu.cn
SLIDE 2 Schedule
- lec1: Introduction on big data and cloud
computing
- Iec2: Introduction on data storage
- lec3: Data reliability (Replication/Archive/EC)
- lec4: Data consistency problem
- lec5: Block level storage and file storage
- lec6: Object-based storage
- lec7: Distributed file system
- lec8: Metadata management
SLIDE 3
Collaborators
SLIDE 4 Data Reliability Problem (1) Google – Disk Annual Failure Rate
SLIDE 5 Data Reliability Problem (2) Facebook-- Failure nodes in a 3000 nodes cluster
SLIDE 6 Contents
Introduction on Replication
1
SLIDE 7 What is Replication?
- Replication can be classified as
- Local replication
- Replicating data within the same array or data center
- Remote replication
- Replicating data at remote site
It is a process of creating an exact copy (replica) of data.
Replication
Source Replica (Target)
REPLICATION
SLIDE 8 File System Consistency: Flushing Host Buffer
File System Application Memory Buffers Logical Volume Manager Physical Disk Driver Data
Flush Buffer
Source Replica
SLIDE 9 Database Consistency: Dependent Write I/O Principle
D Inconsistent C
Consistent
Source Replica 4 4 3 3 2 2 1 1 4 4 3 3 2 1
C
Source Replica
SLIDE 10 Host-based Replication: LVM-based Mirroring
C C
Host Logical Volume Physical Volume 1 Physical Volume 2
- LVM: Logical Volume Manager
SLIDE 11 Host-based Replication: File System Snapshot
C C
replication
Write (CoFW) principle
map
the space used by the production FS
Metadata Production FS Metadata 1 Data a 2 Data b FS Snapshot 3 no data 4 no data BLK Bit 1-0 1-0 2-0 2-0 N Data N 3 Data C 2 Data c 3-1 4 Data D 1 Data d 4-1 3-2 4-1
SLIDE 12 Storage Array-based Local Replication
C C
- Replication performed by the array operating
environment
- Source and replica are on the same array
- Types of array-based replication
- Full-volume mirroring
- Pointer-based full-volume replication
- Pointer-based virtual replication
BC Host Storage Array Replica Source Production Host
SLIDE 13 Full-Volume Mirroring
Source
Attached
Storage Array
Read/Write Not Ready
Production Host BC Host Target
Detached – Point In Time
Read/Write Read/Write
Source Storage Array Production Host BC Host Target
SLIDE 14 Copy on First Access: Write to the Source
Source C’ Target
- When a write is issued to the source for the first time after replication
session activation:
Original data at that address is copied to the target Then the new data is updated on the source This ensures that original data at the point-in-time of activation is
preserved on the target
Production Host BC Host C Write to Source
A B C’ C
SLIDE 15 Copy on First Access: Write to the Target
- When a write is issued to the target for the first time after replication
session activation:
The original data is copied from the source to the target Then the new data is updated on the target
Source B’ Target Production Host BC Host B Write to Target
A B C’ C B’
SLIDE 16 Copy on First Access: Read from Target
- When a read is issued to the target for the first time after replication
session activation:
The original data is copied from the source to the target and is made
available to the BC host
Source A Target Production Host BC Host A Read request for data “A”
A B C’ C B’ A
SLIDE 17 Tracking Changes to Source and Target
Source Target
unchanged changed
Logical OR At PIT Target Source After PIT… 1 1 1 1 1 1 1 1 1 1 1
1
For resynchronization/restore
SLIDE 18 Contents
Introduction to Erasure Codes
2
SLIDE 19 Erasure Coding Basis (1)
- You've got some data
- And a collection of storage
nodes.
- And you want to store the data on the storage nodes so that
you can get the data back, even when the nodes fail..
SLIDE 20 Erasure Coding Basis (2)
- More concrete: You have k
disks worth of data
- And n total disks.
- The erasure code tells you how to create n disks worth of
data+coding so that when disks fail, you can still get the data
SLIDE 21 Erasure Coding Basis (3)
- You have k disks worth of
data
- And n total disks.
- n = k + m
- A systematic erasure code stores the data in the clear on k of
the n disks. There are k data disks, and m coding or “parity”
SLIDE 22 Erasure Coding Basis (4)
- You have k disks worth of
data
- And n total disks.
- n = k + m
- A non-systematic erasure code stores only coding information,
but we still use k, m, and n to describe the code. Vertical Code
SLIDE 23 Erasure Coding Basis (5)
- You have k disks worth of
data
- And n total disks.
- n = k + m
- When disks fail, their contents become unusable, and
the storage system detects this. This failure mode is called an erasure.
SLIDE 24 Erasure Coding Basis (6)
- You have k disks worth of
data
- And n total disks.
- n = k + m
- An MDS (“Maximum Distance Separable”) code can reconstruct
the data from any m failures. Optimal
- Can reconstruct any f failures (f < m) non-MDS code
SLIDE 25 Two Views of a Stripe (1)
– The minimum collection of bits that encode and decode together. – r rows of w-bit symbols from each of n disks:
SLIDE 26 Two Views of a Stripe (2)
– The minimum partition of the system that encodes and decodes together. – Groups together theoretical stripes for performance.
SLIDE 27 Horizontal & Vertical Codes
- Horizontal Code
- Vertical Code
SLIDE 28
Expressing Code with Generator Matrix (1)
SLIDE 29
Expressing Code with Generator Matrix (2)
SLIDE 30
Expressing Code with Generator Matrix (3)
SLIDE 31
Encoding— Linux RAID-6 (1)
SLIDE 32
Encoding— Linux RAID-6 (2)
SLIDE 33
Encoding— Linux RAID-6 (3)
SLIDE 34
Accelerate Encoding— Linux RAID-6
SLIDE 35
Encoding— RDP (1)
SLIDE 36
Encoding— RDP (2)
SLIDE 37
Encoding— RDP (3)
SLIDE 38
Encoding— RDP (4)
SLIDE 39
Encoding— RDP (5)
SLIDE 40 Encoding— RDP (6)
Horizontal Parity Diagonal Parity Data 1 2 3 4 5 6 7 1 2 3 4 5
- Horizontal parity layout (p=7, n=8)
SLIDE 41 Encoding— RDP (7)
- Diagonal parity layout (p=7, n=8)
Horizontal Parity Diagonal Parity Data 1 2 3 4 5 6 7 1 2 3 4 5
SLIDE 42 Arithmetic for Erasure Codes
- When w = 1: XOR's only.
- Otherwise, Galois Field Arithmetic GF(2w)
– w is 2, 4, 8, 16, 32, 64, 128 so that words fit evenly into
computer words. – Addition is equal to XOR.
Nice because addition equals subtraction.
– Multiplication is more complicated:
Gets more expensive as w grows. Buffer-constant different from a * b. Buffer * 2 can be done really fast. Open source library support.
SLIDE 43
Decoding with Generator Matrices (1)
SLIDE 44
Decoding with Generator Matrices (2)
SLIDE 45
Decoding with Generator Matrices (3)
SLIDE 46
Decoding with Generator Matrices (4)
SLIDE 47
Decoding with Generator Matrices (5)
SLIDE 48 Erasure Codes — Reed Solomon (1)
- Given in 1960.
- MDS Erasure codes for any n and k.
– That means any m = (n-k) failures can be tolerated without data loss.
(Theoretical): One word per disk per stripe.
- w constrained so that n ≤ 2w.
- Systematic and non-systematic forms.
SLIDE 49
Erasure Codes —Reed Solomon (2) Systematic RS -- Cauchy generator matrix
SLIDE 50 Erasure Codes —Reed Solomon (3) Non-Systematic RS -- Vandermonde generator matrix
SLIDE 51 Erasure Codes —Reed Solomon (4) Non-Systematic RS -- Vandermonde generator matrix
SLIDE 52 Erasure Codes —EVENODD 1995 (7 disks, tolerating 2 disk failures)
- Horizontal Parity Coding
- Calculated by the data
elements in the same row
- E.g. 𝐷0,5 = 𝐷0,0 ⊕ 𝐷0,1 ⊕ 𝐷0,2 ⊕ 𝐷0,3
⊕ 𝐷0,4
- Diagonal Parity Coding
- Calculated by the data
elements and S
- E.g. 𝐷0,6 = 𝐷0,0 ⊕ 𝐷3,2 ⊕ 𝐷2,3 ⊕
𝐷1,4 ⊕ 𝑇
SLIDE 53 Erasure Codes —X-Code 1999 (1)
- Diagonal parity layout (p=7, n=7)
Diagonal Parity Anti-diagonal Parity Data 1 2 3 4 5 6 1 2 3 4 5 6
SLIDE 54 Erasure Codes —X-Code 1999 (2)
- Anti-diagonal parity layout (p=7, n=7)
Diagonal Parity Anti-diagonal Parity Data 1 2 3 4 5 6 1 2 3 4 5 6
SLIDE 55 Erasure Codes —H-Code (1)
- Horizontal parity layout (p=7, n=8)
Horizontal Parity Anti-diagonal Parity Data 1 2 3 4 5 6 7 1 2 3 4 5
SLIDE 56 Erasure Codes —H-Code (2)
- Anti-diagonal parity layout (p=7, n=8)
Horizontal Parity Anti-diagonal Parity Data 1 2 3 4 5 6 7 1 2 3 4 5
SLIDE 57 Erasure Codes —H-Code (3)
- Recover double disk failure by single recovery chain
Horizontal Parity Anti-diagonal Parity Data Lost Data and Parity
Recovery Chain
1 2 3 4 5 6 7 8 9
10 11 12
X
F H J L B D K E A I G C 1 2 3 4 5 6 7 1 2 3 4 5
SLIDE 58 Erasure Codes —H-Code (4)
- Recover double disk failure by two recovery chains
5 Horizontal Parity Anti-diagonal Parity Data Lost Data and Parity
Recovery Chain
1 2 3 4 5 6
X
D E L J K I H F G A B C 1 2 3 4 5 6 7 1 2 3 4 5 1 2 3 4 6
X
SLIDE 59 Erasure Codes —HDP Code (1)
- Diagonal parity layout (p=7, n=6)
1 2 3 4 5 1 2 3 4 5 HDP ADP Data
SLIDE 60 Erasure Codes —HDP Code (2)
- Diagonal parity layout (p=7, n=6)
1 2 3 4 5 1 2 3 4 5 HDP ADP Data
SLIDE 61 Erasure Codes —HDP Code (3)
- HDP reduces more than 30% average recovery time.
1 2 3 4 5 1 2 3 4 5 HDP ADP Data Lost Data and Parity
A 1 B C D E F K L I J G H F F F F 2 3 4 5 6 1 2 3 4 5 6
SLIDE 62 Contents
Replication and EC in Cloud
3
SLIDE 63
Three Dimensions in Cloud Storage
SLIDE 64
Replication vs Erasure Coding (RS)
SLIDE 65
Fundamental Tradeoff
SLIDE 66
Pyramid Codes (1)
SLIDE 67
Pyramid Codes (2)
SLIDE 68
Pyramid Codes (3) Multiple Hierachies
SLIDE 69
Pyramid Codes (4) Multiple Hierachies
SLIDE 70
Pyramid Codes (5) Multiple Hierachies
SLIDE 71
Pyramid Codes (6)
SLIDE 72
Google GFS II – Based on RS
SLIDE 73
Microsoft Azure (1) How to Reduce Cost?
SLIDE 74
Microsoft Azure (2) Recovery becomes expensive
SLIDE 75
Microsoft Azure (3) Best of both worlds?
SLIDE 76
Microsoft Azure (4) Local Reconstruction Code (LRC)
SLIDE 77
Microsoft Azure (5) Analysis LRC vs RS
SLIDE 78
Microsoft Azure (6) Analysis LRC vs RS
SLIDE 79 Recovery problem in Cloud
- Recovery I/Os from 6 disks (high network bandwidth)
SLIDE 80
Optimizing Recovery Network I/O (1)
SLIDE 81 Optimizing Recovery Network I/O (1)
- Establish recovery relationships among disks
SLIDE 82 Optimizing Recovery I/O (3)
SLIDE 83 Regenerating Codes (1)
SLIDE 84 Regenerating Codes (2)
SLIDE 85 Regenerating Codes (3)
SLIDE 86 Regenerating Codes (4)
SLIDE 87
Regenerating Codes (4) Analysis -- Regenerating vs RS
SLIDE 88
Facebook Xorbas Hadoop Locally Repairable Codes
SLIDE 89
Combination of Two ECs (1) Recovery Cost vs. Storage Overhead
SLIDE 90
Combination of Two ECs (2) Fast Code and Compact Code
SLIDE 91
Combination of Two ECs (3) Analysis
SLIDE 92
Combination of Two ECs (4) Analysis
SLIDE 93
Combination of Two ECs (5) Analysis
SLIDE 94 Combination of Two ECs (6) Conversion
- Horizontal parities require no re-computation
- Vertical parities require no data block transfer
- All parity updates can be done in parallel and in a distributed
manner
SLIDE 95
Combination of Two ECs (7) Results
SLIDE 96 Contents
Project 1
4
SLIDE 97 Erasure Code in Hadoop (1)
- Implement an erasure code into Hadoop system
- Hadoop Version: 2.7 or higher
- Erasure Code: you can select one, but not RS
- Test the storage efficiency of your proposed code
- Report and Source Code are required
- Source Code should be checked by TA
- Deadline: June 30th
SLIDE 98 Erasure Code in Hadoop (2)
http://web.eecs.utk.edu/~plank/plank/www/software.html
http://smahesh.com/HadoopUSC/
SLIDE 99
Thank you!