Store, Forget & Check: Using Algebraic Signatures to Check - - PowerPoint PPT Presentation

store forget check using algebraic signatures to check
SMART_READER_LITE
LIVE PREVIEW

Store, Forget & Check: Using Algebraic Signatures to Check - - PowerPoint PPT Presentation

Store, Forget & Check: Using Algebraic Signatures to Check Remotely Administered Storage Ethan L. Miller & Thomas J. E. Schwarz Storage Systems Research Center University of California, Santa Cruz Whats the problem? Systems


slide-1
SLIDE 1

Ethan L. Miller & Thomas J. E. Schwarz

Storage Systems Research Center University of California, Santa Cruz

Store, Forget & Check: Using Algebraic Signatures to Check Remotely Administered Storage

slide-2
SLIDE 2

What’s the problem?

  • Systems store data on remote nodes
  • Remote nodes may not be trustworthy
  • Data owner must check to ensure that data is really

stored

  • Two current approaches:
  • Read data from multiple sites and check for consistency
  • Generate checksum remotely and compare to checksum of local data
  • We developed an efficient algorithm that does not

require keeping a local copy of the data

2

slide-3
SLIDE 3

Internet storage: backup

  • Participants in the scheme offer limited storage on

their machine in exchange for storing their own data

  • Data protected using parity or redundancy
  • Extra blocks calculated using m/n redundancy codes
  • Generate n blocks
  • Require any m of the blocks to rebuild the data
  • Many known mechanisms for m/n codes
  • Linear interpolation
  • XOR and Galois field-based
  • Participants need to be able to verify that other nodes

are doing their part...

3

slide-4
SLIDE 4

Storage Service Providers

  • Storage utility provides remotely managed storage
  • Client sends data to the SSP
  • Client retrieves data as needed
  • Trust issue: how can client tell if SSP is doing its job?
  • Read data, check (public key-based) signature
  • Read data, decrypt, check secure hash and object ID
  • SafeStore does something like this
  • Other approaches that don’t use network bandwidth?

4

slide-5
SLIDE 5

Peer-to-peer file systems

  • Farsite: uses free space on workstations within an
  • rganization
  • Freehaven: anonymity of storer
  • OceanStore
  • “Billions of users”
  • Byzantine fault tolerance, k-availability through erasure-

codes

  • PAST
  • Users can store files up to their quota
  • Provides k-availability through replication
  • CFS, Intermemory, Ivy, Starfish, …

5

slide-6
SLIDE 6

Common challenges

  • Storage nodes cannot be trusted
  • Storage nodes might lack high uplink bandwidth
  • Storage nodes might have low availability
  • Free Rider problem
  • Node pretends to store data
  • In reality, uses replicas (or protection against unavailability

mechanism) to fetch requested file from elsewhere

  • Gains the benefits of participation without providing storage

6

slide-7
SLIDE 7

Terribly naïve algorithm

7

  • Maintain local copy of data
  • Periodically request blocks of data and compare to the local copy
  • Problems
  • Very bandwidth-intensive
  • Can’t check much data
  • Need to keep the original!
slide-8
SLIDE 8

Terribly naïve algorithm

7

  • Maintain local copy of data
  • Periodically request blocks of data and compare to the local copy
  • Problems
  • Very bandwidth-intensive
  • Can’t check much data
  • Need to keep the original!
slide-9
SLIDE 9

Terribly naïve algorithm

7

  • Maintain local copy of data
  • Periodically request blocks of data and compare to the local copy
  • Problems
  • Very bandwidth-intensive
  • Can’t check much data
  • Need to keep the original!
slide-10
SLIDE 10

Verification: existing algorithm

8

  • Periodically, verify random blocks
  • Compute function across the blocks (m/n coding)
  • Alternative: verify keyed hash stored with the block
  • Problems:
  • Need to transfer entire block
  • Taxes network with diagnostic data
  • Peers often have asymmetric Internet connections
  • Leaks information heavily
slide-11
SLIDE 11

Verification: existing algorithm

8

  • Periodically, verify random blocks
  • Compute function across the blocks (m/n coding)
  • Alternative: verify keyed hash stored with the block
  • Problems:
  • Need to transfer entire block
  • Taxes network with diagnostic data
  • Peers often have asymmetric Internet connections
  • Leaks information heavily
slide-12
SLIDE 12

Verification: existing algorithm

⊕ ⊕

8

  • Periodically, verify random blocks
  • Compute function across the blocks (m/n coding)
  • Alternative: verify keyed hash stored with the block
  • Problems:
  • Need to transfer entire block
  • Taxes network with diagnostic data
  • Peers often have asymmetric Internet connections
  • Leaks information heavily
slide-13
SLIDE 13

Verification: existing algorithm

⊕ ⊕

8

  • Periodically, verify random blocks
  • Compute function across the blocks (m/n coding)
  • Alternative: verify keyed hash stored with the block
  • Problems:
  • Need to transfer entire block
  • Taxes network with diagnostic data
  • Peers often have asymmetric Internet connections
  • Leaks information heavily
slide-14
SLIDE 14

Verification using algebraic signatures

  • Solution: use checksums?
  • Cryptographic checksums (like SHA-1) won’t work for

randomly selected ranges

  • Requires original data for comparison
  • Our scheme
  • Uses small challenges and responses
  • Allows unpredictable tests
  • Free rider can’t just store the answer to all possible challenges (with any

storage benefit)

  • Verifies that all remote chunks are consistent with each other
  • Requires that parity is calculated with an XOR code, a linear m/n code, or

a convolutional code

  • Examples: X-code, EvenOdd, row-diagonal parity, linear codes over a

Galois field

9

slide-15
SLIDE 15

What is a Galois field?

  • Simple answer:
  • Calculations on a set of symbols
  • A field called GF(2n) uses n-bit symbols
  • Two kinds of operations
  • Addition (done by XOR)
  • Multiplication (more complex, done by tables)
  • Complex answer:
  • Galois fields are math done using the coefficients of polynomial

equations

  • Often, coefficients are represented in base-2
  • Galois field using polynomials with maximum degree n and

base-2 coefficients are called GF(2n)

  • This answer explains how the addition and multiplication tables

are generated

10

slide-16
SLIDE 16

What is an algebraic signature?

D1 D2 D3 Dm P1 P2 P3 Pk

(sig(D1),sig(D2),sig(D3), … sig(Dm), sig(P1),sig(P2),sig(P3)…sig(Pk))

  • is a codeword!

11

  • Digital hash with algebraic properties
  • Important properties:
  • Small changes in data result in complete change of signature
  • Signature of parity is parity of signatures
slide-17
SLIDE 17
  • Defined over same Galois field as the linear m/n code
  • Use “primitive” element a
  • All non-zero elements are powers of a
  • Consists of n coordinates
  • Additional properties if ai = ai
  • Coordinate signature defined by

Algebraic signatures

12

slide-18
SLIDE 18

Algebraic signatures

  • Algebraic properties
  • Assume that X and Y are large data objects:
  • sig(X⊕Y) = sig(X) ⊕ sig(Y)
  • sig(β⋅X) = β⋅sig(X)
  • Multiplication is in the Galois field of the signature calculation
  • Signatures and parity formation commute
  • Signatures can be updated from the old signature and the

signature of the delta (XOR) between old and new data

  • Signature calculation is fast!
  • Hundreds of megabytes per second on a modern CPU
  • Speed limited by disk bandwidth

13

slide-19
SLIDE 19

Our algorithm

  • Store data across

distributed system

  • Challenge sites to prove

that they hold the data

  • Sites respond with the

signatures of requested data

  • Sites reveal tiny amount of

information: size of signature

  • Challenger verifies that the

signatures are consistent

D1 D2 D3 P

14

slide-20
SLIDE 20

Our algorithm

  • Store data across

distributed system

  • Challenge sites to prove

that they hold the data

  • Sites respond with the

signatures of requested data

  • Sites reveal tiny amount of

information: size of signature

  • Challenger verifies that the

signatures are consistent

D1 D2 D3 P

Calculate signature of 32 byte ranges at 4+i×71, i = 5,…,20

14

slide-21
SLIDE 21

Our algorithm

  • Store data across

distributed system

  • Challenge sites to prove

that they hold the data

  • Sites respond with the

signatures of requested data

  • Sites reveal tiny amount of

information: size of signature

  • Challenger verifies that the

signatures are consistent

D1 D2 D3 P

Calculate signature of 32 byte ranges at 4+i×71, i = 5,…,20

14

slide-22
SLIDE 22

Our algorithm

  • Store data across

distributed system

  • Challenge sites to prove

that they hold the data

  • Sites respond with the

signatures of requested data

  • Sites reveal tiny amount of

information: size of signature

  • Challenger verifies that the

signatures are consistent

D1 D2 D3 P

Calculate signature of 32 byte ranges at 4+i×71, i = 5,…,20 sig1

14

slide-23
SLIDE 23

Our algorithm

  • Store data across

distributed system

  • Challenge sites to prove

that they hold the data

  • Sites respond with the

signatures of requested data

  • Sites reveal tiny amount of

information: size of signature

  • Challenger verifies that the

signatures are consistent

D1 D2 D3 P

Calculate signature of 32 byte ranges at 4+i×71, i = 5,…,20 sig1 sig2 sig3 sigp

14

slide-24
SLIDE 24

Our algorithm

  • Store data across

distributed system

  • Challenge sites to prove

that they hold the data

  • Sites respond with the

signatures of requested data

  • Sites reveal tiny amount of

information: size of signature

  • Challenger verifies that the

signatures are consistent

D1 D2 D3 P

sig1 sig2 sig3 sigp

14

slide-25
SLIDE 25

Our algorithm

  • Store data across

distributed system

  • Challenge sites to prove

that they hold the data

  • Sites respond with the

signatures of requested data

  • Sites reveal tiny amount of

information: size of signature

  • Challenger verifies that the

signatures are consistent

D1 D2 D3 P

sig1 ⊕ sig2 ⊕ sig3 ≟ sigP sig1 sig2 sig3 sigp

14

slide-26
SLIDE 26

Collusion protection

D1 D2 D3

  • Need for collusion protection
  • All data and parity storing sites could collude to undetectably change the contents of data

(or just throw it away)

  • Return signatures that are internally consistent
  • Modify scheme to prevent collusion by:
  • Using random m/n linear codes to generate parity blocks
  • Blinding data or parity by XORing with pseudo-random stream
  • Stream cipher seeded with block ID and secret known only to data owner

15

slide-27
SLIDE 27

Collusion protection

D1 D2 D3 P1 P2

  • Need for collusion protection
  • All data and parity storing sites could collude to undetectably change the contents of data

(or just throw it away)

  • Return signatures that are internally consistent
  • Modify scheme to prevent collusion by:
  • Using random m/n linear codes to generate parity blocks
  • Blinding data or parity by XORing with pseudo-random stream
  • Stream cipher seeded with block ID and secret known only to data owner

15

slide-28
SLIDE 28

Collusion protection

D1 D2 D3 P1 P2 D1 D2 D3

  • Need for collusion protection
  • All data and parity storing sites could collude to undetectably change the contents of data

(or just throw it away)

  • Return signatures that are internally consistent
  • Modify scheme to prevent collusion by:
  • Using random m/n linear codes to generate parity blocks
  • Blinding data or parity by XORing with pseudo-random stream
  • Stream cipher seeded with block ID and secret known only to data owner

15

slide-29
SLIDE 29

Collusion protection

D1 D2 D3 P1 P2

random

D1 D2 D3 P1

  • Need for collusion protection
  • All data and parity storing sites could collude to undetectably change the contents of data

(or just throw it away)

  • Return signatures that are internally consistent
  • Modify scheme to prevent collusion by:
  • Using random m/n linear codes to generate parity blocks
  • Blinding data or parity by XORing with pseudo-random stream
  • Stream cipher seeded with block ID and secret known only to data owner

15

slide-30
SLIDE 30

Collusion protection

D1 D2 D3 P1 P2 D1 D2 D3 P1 P2

  • Need for collusion protection
  • All data and parity storing sites could collude to undetectably change the contents of data

(or just throw it away)

  • Return signatures that are internally consistent
  • Modify scheme to prevent collusion by:
  • Using random m/n linear codes to generate parity blocks
  • Blinding data or parity by XORing with pseudo-random stream
  • Stream cipher seeded with block ID and secret known only to data owner

15

slide-31
SLIDE 31

Blinding with a random stream

Data

Pseudo-random stream generator

16

  • XOR data stream with a known pseudo-random stream
  • Algebraic signature of blinded data is the XOR of the signature of

the original data and of the blinding stream

  • Data owner need only keep the seed of the pseudo-random stream
  • Reconstruct signature by recalculating stream
slide-32
SLIDE 32

Blinding with a random stream

Data

Pseudo-random stream generator Seed1 Seed2

16

  • XOR data stream with a known pseudo-random stream
  • Algebraic signature of blinded data is the XOR of the signature of

the original data and of the blinding stream

  • Data owner need only keep the seed of the pseudo-random stream
  • Reconstruct signature by recalculating stream
slide-33
SLIDE 33

Blinding with a random stream

Data

Pseudo-random stream generator Seed1 Seed2

Out

16

  • XOR data stream with a known pseudo-random stream
  • Algebraic signature of blinded data is the XOR of the signature of

the original data and of the blinding stream

  • Data owner need only keep the seed of the pseudo-random stream
  • Reconstruct signature by recalculating stream
slide-34
SLIDE 34

Blinding with random stream

  • Prevent colluding sites from discovering parity calculation

scheme by XORing blocks with pseudo-random data

  • Several possibilities:
  • Blind only parity
  • Blind only data after calculating parity
  • Calculate parity, then blind data and parity
  • Blinding must be done after parity calculation
  • Necessary to ensure that storage servers can’t solve for the

generator matrix

  • Storage servers missing both pseudo-random stream and generator
  • Blinding doesn’t prevent data recovery!
  • Example: blinding just the parity means data can be read

17

slide-35
SLIDE 35

Generating random m/n codes

  • Linear m/n codes are defined over Galois field with 2f

elements

  • Galois fields have addition, subtraction, multiplication,

division, 0, 1

  • Same rules in Galois fields as for real, rational, complex, … numbers
  • Code defined by a generator matrix G with m rows

and n columns

  • G has special form ( Im | P )
  • Im is the m×m identity matrix
  • Every m×m submatrix of G is invertible

18

slide-36
SLIDE 36

Generating random m/n codes

Vandermonde matrix Generator matrix

  • Generation of generator matrix with n = 2f:
  • Start with all 2f GF elements in a given order:

a1, a2, … an

  • Order can be changed for a different code
  • Form Vandermonde matrix
  • Use Gaussian algorithm to transfer to desired form

19

slide-37
SLIDE 37

Generating random m/n codes

  • Given data bytes d1, d2, … dm, calculate parity bytes by:

(d1, d2, … dm)⋅G = (d1, d2, … dm, p1, p2, … , pk)

  • Generate random code by starting with a random

permutation of the GF elements

  • Pick a random code using a seed the data owner keeps
  • If all sites collude:
  • G can be reconstructed from known encoding
  • This can be prevented if the sites don’t know the correct values
  • f dx or px for the calculation
  • Blinding ensures they don’t!

20

slide-38
SLIDE 38

Securing a single site

  • Use our scheme to secure data stored on a single site
  • Can place all data and one or two parity chunks on a single site
  • Single site cannot undetectably alter the contents
  • Storage overhead can be made arbitrarily small
  • Anyone knowing the erasure coding and the blinding scheme

can now check that the data is stored accurately

  • May use any randomly-selected slice of the data
  • Storage site can’t only keep data block hashes

21

slide-39
SLIDE 39

Securing a single site

  • Use our scheme to secure data stored on a single site
  • Can place all data and one or two parity chunks on a single site
  • Single site cannot undetectably alter the contents
  • Storage overhead can be made arbitrarily small
  • Anyone knowing the erasure coding and the blinding scheme

can now check that the data is stored accurately

  • May use any randomly-selected slice of the data
  • Storage site can’t only keep data block hashes

21

slide-40
SLIDE 40

Example

  • Create a backup peer-to-peer system
  • Business model
  • Clients
  • Register with central site
  • Clients pay annual registration fee
  • Clients promise
  • to store 10GB + ~2% on their computer
  • to have their machine connected to the internet at least 20 hours every

24 hours

  • Enterprise promises
  • To establish an addressing scheme that allows clients to store data

remotely

  • To verify that clients fulfill their promises

22

slide-41
SLIDE 41

Example

  • Use of our technique:
  • Enterprise can verify that the data is stored correctly
  • Client might encrypt data to prevent anyone else from

rebuilding it

  • May use any desired encryption algorithm, as long as the parity is

calculated after encryption

  • Blinding still has to be done...
  • Signatures don’t leak much data anyway...
  • Alternative
  • Client that stores data also prepares a number of signature

challenges

  • Client gives those to the enterprise
  • Enterprise uses the challenge, but might run out of them

23

slide-42
SLIDE 42

Future work

  • Implementing a peer-to-peer backup system using this

approach

  • Includes techniques from prior peer-to-peer backup systems
  • Adds low-overhead verification
  • Using this technique in POTSHARDS to verify that

remote servers are maintaining data

  • Limited leakage is critical for this application
  • Use this technique for storage service providers?

24

slide-43
SLIDE 43

Conclusions

  • We developed a scheme that can verify data stored on

storage sites outside of our administrative control

  • Small challenges and responses
  • Good for network load
  • Limits information leakage to negligible sizes.
  • Sufficient variety to force potential free-loaders to store the data

—not potential answers

  • Secure against collusion
  • Basic scheme for storage schemes that use parity or m/n

coding for high availability

  • Extension to storing data on a single site
  • Incurs arbitrarily small overhead from storing additional parity

data

25

slide-44
SLIDE 44

Questions?

More information on the Web at http://www.ssrc.ucsc.edu/proj/archive.html Thanks to SSRC faculty, students, and sponsors!

26