15 853 algorithms in the real world error correcting
play

15-853:Algorithms in the Real World Error Correcting Codes (cont..) - PowerPoint PPT Presentation

15-853:Algorithms in the Real World Error Correcting Codes (cont..) Scribe volunteers: ? Announcement: Scribe notes template and instructions on the course webpage 15-853 Page1 General Model Noise introduced by the channel: message


  1. 15-853:Algorithms in the Real World Error Correcting Codes (cont..) Scribe volunteers: ? Announcement: Scribe notes template and instructions on the course webpage 15-853 Page1

  2. General Model “Noise” introduced by the channel: message (m) • changed fields in the codeword vector (e.g. a flipped bit). encoder • Called errors codeword (c) • missing fields in the codeword noisy vector (e.g. a lost byte). channel • Called erasures codeword’ (c’) decoder How the decoder deals with errors and/or erasures? • detection (only needed for message or error errors) • correction 15-853 Page2

  3. Block Codes message (m) Each message and codeword is of fixed size coder  = codeword alphabet k =|m| n = |c| q = |  | codeword (c) noisy C = “code” = set of codewords channel C  S n (codewords) codeword ’ (c’) decoder D (x,y) = number of positions s.t. x i  y i d = min{ D (x,y) : x,y  C, x  y} message or error Code described as: (n,k,d) q 15-853 Page3

  4. Role of Minimum Distance Theorem: A code C with minimum distance “d” can: 1. detect any (d-1) errors 2. recover any (d-1) erasures 3. correct any <write> errors Stated another way: For s-bit error detection d  s + 1 For s-bit error correction d  2s + 1 To correct a erasures and b errors if d  a + 2b + 1 15-853 Page4

  5. Next we will see an application of erasure codes in today’s large -scale data storage systems 15-853 Page 5

  6. Large-scale distributed storage systems 1000s of interconnected servers 100s of petabytes of data • Commodity components • Software issues, power failures, maintenance shutdowns

  7. Large-scale distributed storage systems 1000s of interconnected servers Unavailabilities are the norm 100s of petabytes of data rather than the exception • Commodity components • Software issues, power failures, maintenance shutdowns

  8. Facebook analytics cluster in production: unavailability statistics • Multiple thousands of servers • Unavailability event: server unresponsive for > 15 min 350 300 250 #unavailability 200 events 150 100 50 median: 52 0 0 5 10 15 20 25 30 day [Rashmi, Shah, Gu, Kuang, Borthakur, Ramchandran, USENIX HotStorage 2013 and ACM SIGCOMM 2014]

  9. Facebook analytics cluster in production: unavailability statistics • Multiple thousands of servers • Unavailability event: server unresponsive for > 15 min 350 300 250 #unavailability Daily server unavailability = 0.5 - 1% 200 events 150 100 50 median: 52 0 0 5 10 15 20 25 30 day [Rashmi, Shah, Gu, Kuang, Borthakur, Ramchandran, USENIX HotStorage 2013 and ACM SIGCOMM 2014]

  10. Servers unavailable Data inaccessible Applications cannot wait, Data cannot be lost Data needs to be stored in a redundant fashion

  11. Traditional approach: Replication • Storing multiple copies of data: Typically 3x-replication “blocks” a b c d a b b c d d a c 3 replicas b d a a b c c d a a b b c c d d distributed on servers across network … …

  12. Traditional approach: Replication • Storing multiple copies of data: Typically 3x-replication “blocks” a b c d Too expensive for large-scale data a b b c d d a c 3 replicas b d a a b c c d a a b b c c d d Better alternative: sophisticated codes distributed on servers across network … …

  13. Two data blocks to be stored: and a b Tolerate any 2 failures a block 1 a block 1 a block 2 b block 2 a block 3 a+b block 3 b block 4 a+2b block 4 b block 5 “parity blocks” block 6 b 3-replication Erasure code Storage overhead = 3x Storage overhead = 2x

  14. Two data blocks to be stored: and a b Tolerate any 2 failures a block 1 a block 1 a block 2 Much less storage b block 2 a block 3 for desired fault tolerance a+b block 3 b block 4 a+2b block 4 b block 5 “parity blocks” block 6 b 3-replication Erasure code Storage overhead = 3x Storage overhead = 2x

  15. Erasure codes: how are they used in distributed storage systems? Example: a b d f h j c e g i a a b b c c d d e e f f g g h h i i j j P1 P2 P3 P4 P1 P2 P3 P4 10 data blocks 4 parity blocks distributed to servers … …

  16. Almost all large-scale storage systems today employ erasure codes Facebook, Google, Amazon, Microsoft... “Considering trends in data growth & datacenter hardware, we foresee HDFS erasure coding being an important feature in years to come ” - Cloudera Engineering (September, 2016)

  17. Error Correcting Multibit Messages We will first discuss Hamming Codes Named after Richard Hamming (1915-1998), a pioneer in error-correcting codes and computing in general. 15-853 Page17

  18. Error Correcting Multibit Messages We will first discuss Hamming Codes Codes are of form: (2 r -1, 2 r -1 – r, 3) for any r > 1 e.g. (3,1,3), (7,4,3), (15,11,3), (31, 26, 3), … which correspond to 2, 3, 4, 5, … “parity bits” (i.e. n -k) Question: Error detection and correction capability? (Can detect 2-bit errors, or correct 1-bit errors.) The high- level idea is to “localize” the error. 15-853 Page18

  19. Hamming Codes: Encoding r = 4 Localizing error to top or bottom half 1xxx or 0xxx m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 m 3 p 0 p 8 = m 15  m 14  m 13  m 12  m 11  m 10  m 9 Localizing error to x1xx or x0xx m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 p 4 m 3 p 0 p 4 = m 15  m 14  m 13  m 12  m 7  m 6  m 5 Localizing error to xx1x or xx0x m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 p 4 m 3 p 2 p 0 p 2 = m 15  m 14  m 11  m 10  m 7  m 6  m 3 Localizing error to xxx1 or xxx0 m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 p 4 m 3 p 2 p 1 p 0 p 1 = m 15  m 13  m 11  m 9  m 7  m 5  m 3 15-853 Page19

  20. Hamming Codes: Decoding m 15 m 14 m 13 m 12 m 11 m 10 m 9 p 8 m 7 m 6 m 5 p 4 m 3 p 2 p 1 p 0 We don’t need p 0 , so we have a (15,11,?) code. After transmission, we generate b 8 = p 8  m 15  m 14  m 13  m 12  m 11  m 10  m 9 b 4 = p 4  m 15  m 14  m 13  m 12  m 7  m 6  m 5 b 2 = p 2  m 15  m 14  m 11  m 10  m 7  m 6  m 3 b 1 = p 1  m 15  m 13  m 11  m 9  m 7  m 5  m 3 With no errors, these will all be zero With one error b 8 b 4 b 2 b 1 gives us the error location. e.g. 0100 would tell us that p 4 is wrong, and 1100 would tell us that m 12 is wrong 15-853 Page20

  21. Hamming Codes Can be generalized to any power of 2 – n = 2 r – 1 (15 in the example) – (n-k) = r (4 in the example) – Can correct one error – d ≥ 3 (since we can correct one error) – Gives (2 r -1, 2 r -1-r, 3) code (We will later see an easy way to prove the minimum distance) Extended Hamming code – Add back the parity bit at the end – Gives (2 r , 2 r -1-r, 4) code – Can still correct one error, but now can detect 3 15-853 Page21

  22. A Lower bound on parity bits: Hamming bound How many nodes in hypercube do we need so that d = 3? Each of 2 k codewords eliminates n neighbors plus itself, i.e. n+1   n k 2 ( n 1 ) 2    n k log ( n 1 ) 2      n k log ( n 1 ) 2 In above Hamming code, 15  11 +  log 2 (15+1)  = 15. Hamming Codes are called perfect codes since they match the lower bound exactly. 15-853 Page22

  23. A Lower bound on parity bits: Hamming bound What about fixing 2 errors (i.e. d=5)? Each of the 2 k codewords eliminates itself, its neighbors and its neighbors’ neighbors, giving: <board> Generally to correct s errors:       n n n              log 2 ( 1 ) n k             1 2 s 15-853 Page23

  24. Lower Bounds: a side note The lower bounds assume arbitrary placement of bit errors. In practice errors are likely to have patterns: maybe evenly spaced, or clustered: x x x x x x x x x x x x Can we do better if we assume regular errors ? We will come back to this later when we talk about Reed- Solomon codes. This is a big reason why Reed-Solomon codes are used much more than Hamming-codes. 15-853 Page24

  25. Q: If no structure in the code, how would one perform encoding? <board> Gigantic lookup table! If no structure in the code, encoding is highly inefficient. A common kind of structure added is linearity 15-853 Page25

  26. Linear Codes If  is a field, then  n is a vector space Definition : C is a linear code if it is a linear subspace of  n of dimension k. This means that there is a set of k independent vectors v i   n (1  i  k) that span the subspace. i.e. every codeword can be written as: where a i   c = a 1 v 1 + a 2 v 2 + … + a k v k “Basis (or spanning) Vectors” 15-853 Page26

  27. Some Properties of Linear Codes 1. Linear combination of two codewords is a codeword. <board> 2. Minimum distance (d) = weight of least weight (non-zero) codewords <Write proof> 15-853 Page27

  28. Generator and Parity Check Matrices 3. Every linear code has two matrices associated with it. 1. Generator Matrix : A k x n matrix G such that: C = { xG | x   k } Made from stacking the spanning vectors k n n mesg = codeword G 15-853 Page28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend