Overview ECE 753: FAULT-TOLERANT Introduction Motivation and - - PDF document

overview ece 753 fault tolerant
SMART_READER_LITE
LIVE PREVIEW

Overview ECE 753: FAULT-TOLERANT Introduction Motivation and - - PDF document

2/27/2014 Overview ECE 753: FAULT-TOLERANT Introduction Motivation and Background COMPUTING Hamming Codes by example SEC-DED Codes Algebraic method Kewal K Saluja Kewal K.Saluja SEC-DED Codes Hardware SEC


slide-1
SLIDE 1

2/27/2014 1

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K Saluja Kewal K.Saluja

Department of Electrical and Computer Engineering

Low Level Fault-Tolereance: ECC

Overview

  • Introduction
  • Motivation and Background
  • Hamming Codes – by example
  • SEC-DED Codes – Algebraic method

SEC DED C d H d

ECE 753 Fault Tolerant Computing 2

  • SEC-DED Codes – Hardware
  • SEC-DED-SBD Codes
  • Cyclic Codes – (time permitting)
  • Summary

Introduction

  • References
  • Chapter 3 of Koren and Krishna
  • Appendix A of the book [siew:92] – also

included in the set of reading material

ECE 753 Fault Tolerant Computing 3

  • Following references
  • Reddy – “A class of linear codes …” IEEETC,

May 1978

  • Any book on coding theory

Motivation and Background

  • Memories are integral part of digital

systems (computers)

  • Majority of chip and/or board area is

t k b i

ECE 753 Fault Tolerant Computing 4

taken by memories

  • Hence – reliability improvement

methods must pay attention to memories (RAMs, ROMs, etc.)

Motivation and Background (contd.)

  • Types of faults prevalent in memories
  • During manufacturing

– Stuck-at – Timing faults – Coupling and pattern sensitive faults

ECE 753 Fault Tolerant Computing 5

  • During operation

– Cell failures due to life, stress – same as stuck-at – Alpha particle hits – cell content change

  • Sensitive to system location. Higher hits at altitudes and in flight

– Need non-testing based solutions – Random failures – bit/nibble/byte/card failures

Motivation and Background (contd.)

  • Theoretical Foundation

– Linear and modern algebra

  • Concept of groups, fields, and vector spaces
  • We will focus on binary codes but will have to include

polynomial algebra

  • Theory – Informal definitions and results

ECE 753 Fault Tolerant Computing 6

y

– Vector: A collection of bits represented as a string – Information bits - collection of k-bits – Code word: encoded information bit string

  • k information bits encoded to n bits. Encoded information word

is a code word.

– Check bits: r (= n-k) extra bits used to encode information bits

slide-2
SLIDE 2

2/27/2014 2

Motivation and Background (contd.)

  • Theory – Informal definitions and results

– Hamming weight of a vector v: Number of 1’s in v – Hamming distance (HD) between a pair of vectors v1 and v2: number of places two vectors differ from each other. HD( ) HW( ⊕ )

ECE 753 Fault Tolerant Computing 7

HD(v1, v2) = HW(v1⊕v2) – Code: Collection of code words. – Block code: each code word contains same number of bits. – Minimum Hamming distance of a code: Minimum

  • f all HDs between all pairs of code words in a

code.

Motivation and Background (contd.)

Theory – Informal definitions and results (contd.)

– Error detection: Erroneous word (a code word with one or more bit errors) is not a code word

  • Basic results 1: A code is capable of t error detection

if and only if min HD of the code is at least t+1.

– Proof: use sphere packing argument to show this.

ECE 753 Fault Tolerant Computing 8

  • Example: Use of parity –we know that we can detect

single error. What is the minimum HD for such a code? Prove that the min HD is 2 using the argument that no two binary strings with even (odd) Hamming weight can have a HD of 1.

Motivation and Background (contd.)

Theory – Informal definitions and results (contd.)

  • Basic results 2: A code is capable of

correcting t errors if and only if min HD of the code is at least 2t+1

ECE 753 Fault Tolerant Computing 9

code is at least 2t+1.

– Proof: use sphere packing argument as before.

  • Combine the two results: A code is a capable
  • f correcting t errors and detecting d errors (d

≥ t) if and only if min HD of the code is at least t+d+1.

Hamming Codes – by example

  • A linear block code
  • Consider a (7,4) Hamming code
  • Let i1 i2 i3 i4 be information symbols

Let p p p be check symbols

ECE 753 Fault Tolerant Computing 10

  • Let p1p2 p4 be check symbols
  • The parity equations:

p1 = i1 ⊕ i2 ⊕ i4 p2 = i1 ⊕ i3 ⊕ i4 p4 = i2 ⊕ i3 ⊕ i4

Hamming Codes – by example (contd.)

  • Can write the equations as follows (easy to

remember) p1 p2 i1 p4 i2 i3 i4 1 1 1 1

ECE 753 Fault Tolerant Computing 11

1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 2 3 4 5 6 7 This encodes a 4-bit information word into a 7- bit codeword

Hamming Codes – by example (contd.)

  • Properties of the code

– If there is no error, all parity equations will be satisfied – Denote the outcomes of these equation h k

ECE 753 Fault Tolerant Computing 12

checks as c1, c2, c4 – If there is exactly one error, then c1, c2, c4 point to the error – The vector c1, c2, c4 is called syndrome – The above (7,4) Hamming code is SEC code

slide-3
SLIDE 3

2/27/2014 3

Hamming Codes – by example (contd.)

  • The above method of construction can be

generalized to construct an (n,k) Hamming code

  • Simple bound

k = number of information bits b f h k bit

ECE 753 Fault Tolerant Computing 13

r = number of check bits n = k + r = total number of bits n + 1 = number of single or fewer errors Each error (including no error) must have a distinct syndrome With r check bits max possible syndrome = 2r Hence: 2r ≥ n + 1

Hamming Codes – by example (contd.)

Simple bound

When: 2r = n + 1 the corresponding Hamming code is a perfect code

  • Perfect Hamming codes can be

t t d f ll

ECE 753 Fault Tolerant Computing 14

constructed as follows: p1 p2 i1 p4 i2 i3 i4 p8 i5 . . . . . . 20 21 3 22 5 6 7 23 9 . . . . . . Parity equations can be written as before from the above matrix representation

SEC-DED Codes – Algebraic method

  • Definitions

– (G, *) – An abelian (commutative) Group

  • There is a 0 in G (identity)
  • For every a in G a-1 is also in G (inverses)

ECE 753 Fault Tolerant Computing 15

  • For all a and b in a*b = b*a is also in G

(closed)

– Examples

  • G = (0, 1); * = ⊕ (Exclusive-OR)
  • (Z3, +3) is a commutative group

SEC-DED Codes – Algebraic method (contd.)

  • Definitions (contd.)

– (F, +, .) – A Field if

  • (F, +) is an abelian group with identity of 0
  • (F

) is an abelian group

ECE 753 Fault Tolerant Computing 16

  • (F - 0, .) is an abelian group

– Examples

  • (F, ⊕, .) is a Field
  • F = (0, 1); ⊕ = Exclusive-OR; . = AND
  • The above Field is called GF(2)

SEC-DED Codes – Algebraic method (contd.)

  • Definitions (contd.)

– Vector space over a field F

  • (V, +) is an abelian group
  • v in V and c in F  cv is V

ECE 753 Fault Tolerant Computing 17

  • c(u + v) = cu + cv
  • (c+d)v = cv + dv
  • C(dv) = (cd)v

– S ⊆ V is a subspace if S is a vector space – A linear combination of vectors is a vector

  • u = c1v1 + c2v2 + c3v3 + … + cnvn

SEC-DED Codes – Algebraic method (contd.)

  • Some results and more definitions

– Over GF(2) a collection of all n-bit vectors forms a vector space – Let v1, v2, … , vk be n-bit vectors each.

ECE 753 Fault Tolerant Computing 18

1 2 k

Then all 2k linear combinations of these k vectors form a subspace – A set of k vectors v1, v2, … , vk is linearly independent if for not all ci = 0, i = 1, …, k c1v1 + c2v2 + c3v3 + … + ckvk ≠ 0

slide-4
SLIDE 4

2/27/2014 4

SEC-DED Codes – Algebraic method (contd.)

  • Some results and more definitions

(contd.)

– Largest number of linearly independent vectors in a vector space is the dimension

  • f the space.

ECE 753 Fault Tolerant Computing 19

  • Dimension of the space containing all n-bit

vectors is n

  • Dimension of the space containing all 2k linear

combinations of k vectors was no more than k.

– A binary (n,k) linear block code is a k- dimensional subspace of an n-dimensional vector space

SEC-DED Codes – Algebraic method (contd.)

  • A binary (n,k) linear block code can be described by a collection
  • f k carefully chosen vectors. Each code word is a linear

combination of these k-vectors, thus forming a k-dimensional subspace.

  • These k-vectors can be written as a k×n matrix G, called

Generator matrix. A code word for a k-bit information word, say vector a, is obtained by aG E l F th (7 4) H i d d ib d li

ECE 753 Fault Tolerant Computing 20

  • Example: For the (7,4) Hamming code described earlier

p1 p2 i1 p4 i2 i3 i4 1 1 1 0 0 0 0 1 0 0 1 1 0 0 = G 0 1 0 1 0 1 0 1 1 0 1 0 0 1 Note: a code word is a linear combination of rows of G

SEC-DED Codes – Algebraic method (contd.)

  • Two vectors v1 and v2 are orthogonal if v1 . v2 = 0
  • The G matrix can also be represented by an r×n matrix H in

which each n vector of H is orthogonal to every vector of G.

  • Hence GHT = 0
  • dim G + dim H = n

Example: For the (7 4) Hamming code described earlier the H

ECE 753 Fault Tolerant Computing 21

  • Example: For the (7,4) Hamming code described earlier the H

matrix is:

p1 p2 i1 p4 i2 i3 i4 1 0 1 0 1 0 1 0 1 1 0 0 1 1 = H 0 0 0 1 1 1 1

  • Check that GHT = 0

SEC-DED Codes – Algebraic method (contd.)

  • There are two ways to encode data words

– Use G (generator) matrix – Use H (parity check) matrix

  • We will use H – being of lower dimensionality
  • Consider the following representation of H

H = [ Pr | Ir ], where Pr is r×k matrix and Ir is r×r matrix

  • Consider a code word (a1, a2, … , ak, p1, p2 … pr)
  • We can wirite parity check equations from the above H i e from

ECE 753 Fault Tolerant Computing 22

We can wirite parity check equations from the above H, i.e. from HaT

  • Example: For the (7,4) Hamming code we can write H matrix

as:

a1 a2 a3 a4 p1 p2 p4 1 0 1 1 1 0 0 1 1 0 1 0 1 0 = H 0 1 1 1 0 0 1

  • Can obtain previous parity equations from this H in a simple

manner

SEC-DED Codes – Algebraic method (contd.)

  • Note the H is specified such that all information bits

stay intact & together and check bits stay together and depend only on information bits

  • A code specified by an H of the above type is called a

systematic code

ECE 753 Fault Tolerant Computing 23

y

– Data bits and check bits stay separate from each other – It is easy to extract data bits from a code word

  • Statement: rearrangement of columns of H does not

change the code. All it does is that it changes the position of the check bits and information bits

  • Question: when can we write an arbitrary H in

systematic form?

SEC-DED Codes – Algebraic method (contd.)

  • Theorem: H, an r×n matrix and rank(H) = r (rank r

means H contains r linearly independent columns), then H can be transformed to a systematic form

– Row operation on H means linear combination of parity check equations. Thus solution of equations does not change

ECE 753 Fault Tolerant Computing 24

change – First rearrange columns of H such that last r columns are linearly independant – Next find a matrix M such that M performs row operations on H such that M when multiplies the last r columns, it gives an unity r×r matrix. Thus M in fact is the inverse of the matrix that consists of the last r columns of H – Now the the matrix MH will be in systematic form

  • Example in class
slide-5
SLIDE 5

2/27/2014 5

SEC-DED Codes – Algebraic method (contd.)

  • Definition: Syndrome S of an n-bit x word is

S = HxT Note – S is an r-bit vector

  • Note also in the above equation xT provides a linear combination
  • f columns of H
  • Example consider a (6,3) systematic H and consider a 6-bit

ECE 753 Fault Tolerant Computing 25

vector x

  • Theorem: for an (n,k) linear block code represented by H the

syndrome of every code word is 0

– Proof is more or less based on the way we have defined a block code and H matrix

  • Definition: Error word, E, is a vector that represents where a

codeword is erroneous

  • Example in class to define all these terms

SEC-DED Codes – Algebraic method (contd.)

  • Theorem: let C be a code word and E be an error

word, i.e. C’ = C + E is the erroneous word (code word with error in it). Let S’ be the syndrome of the word C’ then S’ = HET

  • Theorem: A linear block code represented by H is

ECE 753 Fault Tolerant Computing 26

Theorem: A linear block code represented by H is SEC if and only if the columns of H are distinct and non zero

  • Theorem: A linear block code represented by H is

SEC-DED if:

– All columns of H are distinct and non zero – Sum of any two columns of H is non zero and is not equal to a third column of H

SEC-DED Codes – Algebraic method (contd.)

  • Consider an H matrix in which each column has odd number of

1’s code generated by such an H matrix is called odd weight column code

  • Example: consider r = 4. Let us consider an H, a 4×8 matrix:

1 1 1 1

ECE 753 Fault Tolerant Computing 27

1 0 0 0 0 1 1 1 0 1 0 0 1 0 1 1 = H 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 0 wt = 1 columns wt = 3 columns This is a (8,4) SEC-DED code

  • Theorem: Odd weight column code is a SEC-DED code
  • Theorem: Hamming code with overall parity is a SEC-DED code

SEC-DED Codes – Algebraic method (contd.)

  • Shortened codes

– Some times we are interested in code that do not exactly satisfy the bound derived for perfect Hamming codes. For example consider the case when k=8. Clearly we will need r=5. But we do not want to have a (15,11). What we want a

ECE 753 Fault Tolerant Computing 28

(12,8) code. Following result comes handy to design such codes and still have error correction capability – Result: Deleting columns of H does not alter the error correction capability of the corresponding code

  • Proof: the conditions stated in the theorem (for example

columns remaining odd weight columns, or no two columns being identical) do not change by deleting columns of H.

  • What columns to delete? See next hardware issue.

SEC-DED Codes –Hardware

  • Encoding hardware

K inf bits

ECE 753 Fault Tolerant Computing 29

XOR Tree

R check bits

K inf bits

SEC-DED Codes –Hardware (contd.)

  • Decoding hardware – Algorithm

– Compute syndrome S – If S = 0 then no error – If S ≠ 0 { decode S

ECE 753 Fault Tolerant Computing 30

{

– If S is in range (decoded S ≤ n) then correct sth bit – Else there is an uncorrectable error

}

  • Note: it is easy to determine if S is 0
  • Decoding S is also straight forward
  • Correction implies a bit flip (EOR operation)
slide-6
SLIDE 6

2/27/2014 6

SEC-DED Codes –Hardware (contd.)

  • Decoding hardware – Implementation

k r

ECE 753 Fault Tolerant Computing 31

EOR tree

Error corrector n EORs Syndrome

n . . .

  • r

nor

and

decoder

Corrected word

SEC-DED Codes –Hardware (contd.)

  • Hardware simplification

– Reduce number of EORs

  • Have as few 1s in the matrix as possible

Reduce delay depth of EOR tree

ECE 753 Fault Tolerant Computing 32

– Reduce delay – depth of EOR tree

  • Have as few 1s in each row of H as possible

SEC-DED-SBD Codes

  • Motivation

– Many memories are organizes as byte

  • riented

– Failures manifest themselves as follows

  • Random failure – bit error

ECE 753 Fault Tolerant Computing 33

  • Chip failure – byte error

– Objective is to detect such byte errors while detect and correct random errors. Hence the error model

  • Single random error
  • Multiple errors limited within a byte

SEC-DED-SBD Codes (contd.)

  • Theorem (Reddy): Let E1 and E2 be

two sets of error patterns and E1∩E2 = φ. A linear block described by H can correct all errors in E1 and detect all errors in E if and only if

ECE 753 Fault Tolerant Computing 34

errors in E2 if and only if

a) For e in E1∪E2 HeT ≠ 0 b) For ei, ej in E1 Hei

T ≠ Hej T and

c) For an ei in E2 there is no ej in E1 such that Hei

T = Hej T

SEC-DED-SBD Codes (contd.)

  • To demonstrate the use of the theorem, let us look

at an example H matrix and its capabilities for a small byte (nibble) size

  • b = number of bits in each memory card
  • n = total number of bits in a code word

ECE 753 Fault Tolerant Computing 35

  • r = number of check bits
  • n = b(2r-b+1 –1)
  • For b = 4 and r = 5 we have n = 12. Thus we will

construct a (12,7) code which will be able to correct any single error and detect errors confined to 4-bit nibbles

SEC-DED-SBD Codes (contd.)

  • Many parts of the code are shown as blocks

in the following figure

Correction

ECE 753 Fault Tolerant Computing 36

Detect mult Errors in byte Correction part

slide-7
SLIDE 7

2/27/2014 7

SEC-DED-SBD Codes (contd.)

  • Now let us look at the complete matrix

0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1

ECE 753 Fault Tolerant Computing 37

1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1

SEC-DED-SBD Codes (contd.)

  • The capability can be proven as follows
  • E1 single error, E2 errors limited to 4-bit

nibbles

  • All columns are non-zero and any

combinations of columns within 4-bit nibble are also non-zero

ECE 753 Fault Tolerant Computing 38

are also non-zero

  • All columns are distinct – providing single

error correction capability

  • The last 3 rows provide guarantee that no

combination of errors limited to a nibble will have a syndrome identical to single error syndrome

SEC-DED-SBD Codes (contd.)

  • Two comments

– The code can be converted to a systematic code – Distance of the code can be increased by 1 to make it a DED code

ECE 753 Fault Tolerant Computing 39

– This code can also be shortened

Summary

  • Why ECC in Fault tolerance
  • Hamming code – by example
  • Algebra and Algebraic coding

ECE 753 Fault Tolerant Computing 40

– Codes – Hardware

  • SEC-SBD code