Store, Forget & Check: Using Algebraic Signatures to Check - - PowerPoint PPT Presentation
Store, Forget & Check: Using Algebraic Signatures to Check - - PowerPoint PPT Presentation
Store, Forget & Check: Using Algebraic Signatures to Check Remotely Administered Storage Ethan L. Miller & Thomas J. E. Schwarz Storage Systems Research Center University of California, Santa Cruz Whats the problem? Systems
What’s the problem?
- Systems store data on remote nodes
- Remote nodes may not be trustworthy
- Data owner must check to ensure that data is really
stored
- Two current approaches:
- Read data from multiple sites and check for consistency
- Generate checksum remotely and compare to checksum of local data
- We developed an efficient algorithm that does not
require keeping a local copy of the data
2
Internet storage: backup
- Participants in the scheme offer limited storage on
their machine in exchange for storing their own data
- Data protected using parity or redundancy
- Extra blocks calculated using m/n redundancy codes
- Generate n blocks
- Require any m of the blocks to rebuild the data
- Many known mechanisms for m/n codes
- Linear interpolation
- XOR and Galois field-based
- Participants need to be able to verify that other nodes
are doing their part...
3
Storage Service Providers
- Storage utility provides remotely managed storage
- Client sends data to the SSP
- Client retrieves data as needed
- Trust issue: how can client tell if SSP is doing its job?
- Read data, check (public key-based) signature
- Read data, decrypt, check secure hash and object ID
- SafeStore does something like this
- Other approaches that don’t use network bandwidth?
4
Peer-to-peer file systems
- Farsite: uses free space on workstations within an
- rganization
- Freehaven: anonymity of storer
- OceanStore
- “Billions of users”
- Byzantine fault tolerance, k-availability through erasure-
codes
- PAST
- Users can store files up to their quota
- Provides k-availability through replication
- CFS, Intermemory, Ivy, Starfish, …
5
Common challenges
- Storage nodes cannot be trusted
- Storage nodes might lack high uplink bandwidth
- Storage nodes might have low availability
- Free Rider problem
- Node pretends to store data
- In reality, uses replicas (or protection against unavailability
mechanism) to fetch requested file from elsewhere
- Gains the benefits of participation without providing storage
6
Terribly naïve algorithm
7
- Maintain local copy of data
- Periodically request blocks of data and compare to the local copy
- Problems
- Very bandwidth-intensive
- Can’t check much data
- Need to keep the original!
Terribly naïve algorithm
7
- Maintain local copy of data
- Periodically request blocks of data and compare to the local copy
- Problems
- Very bandwidth-intensive
- Can’t check much data
- Need to keep the original!
Terribly naïve algorithm
≟
7
- Maintain local copy of data
- Periodically request blocks of data and compare to the local copy
- Problems
- Very bandwidth-intensive
- Can’t check much data
- Need to keep the original!
Verification: existing algorithm
8
- Periodically, verify random blocks
- Compute function across the blocks (m/n coding)
- Alternative: verify keyed hash stored with the block
- Problems:
- Need to transfer entire block
- Taxes network with diagnostic data
- Peers often have asymmetric Internet connections
- Leaks information heavily
Verification: existing algorithm
8
- Periodically, verify random blocks
- Compute function across the blocks (m/n coding)
- Alternative: verify keyed hash stored with the block
- Problems:
- Need to transfer entire block
- Taxes network with diagnostic data
- Peers often have asymmetric Internet connections
- Leaks information heavily
Verification: existing algorithm
⊕ ⊕
8
- Periodically, verify random blocks
- Compute function across the blocks (m/n coding)
- Alternative: verify keyed hash stored with the block
- Problems:
- Need to transfer entire block
- Taxes network with diagnostic data
- Peers often have asymmetric Internet connections
- Leaks information heavily
Verification: existing algorithm
⊕ ⊕
≟
8
- Periodically, verify random blocks
- Compute function across the blocks (m/n coding)
- Alternative: verify keyed hash stored with the block
- Problems:
- Need to transfer entire block
- Taxes network with diagnostic data
- Peers often have asymmetric Internet connections
- Leaks information heavily
Verification using algebraic signatures
- Solution: use checksums?
- Cryptographic checksums (like SHA-1) won’t work for
randomly selected ranges
- Requires original data for comparison
- Our scheme
- Uses small challenges and responses
- Allows unpredictable tests
- Free rider can’t just store the answer to all possible challenges (with any
storage benefit)
- Verifies that all remote chunks are consistent with each other
- Requires that parity is calculated with an XOR code, a linear m/n code, or
a convolutional code
- Examples: X-code, EvenOdd, row-diagonal parity, linear codes over a
Galois field
9
What is a Galois field?
- Simple answer:
- Calculations on a set of symbols
- A field called GF(2n) uses n-bit symbols
- Two kinds of operations
- Addition (done by XOR)
- Multiplication (more complex, done by tables)
- Complex answer:
- Galois fields are math done using the coefficients of polynomial
equations
- Often, coefficients are represented in base-2
- Galois field using polynomials with maximum degree n and
base-2 coefficients are called GF(2n)
- This answer explains how the addition and multiplication tables
are generated
10
What is an algebraic signature?
D1 D2 D3 Dm P1 P2 P3 Pk
(sig(D1),sig(D2),sig(D3), … sig(Dm), sig(P1),sig(P2),sig(P3)…sig(Pk))
- is a codeword!
11
- Digital hash with algebraic properties
- Important properties:
- Small changes in data result in complete change of signature
- Signature of parity is parity of signatures
- Defined over same Galois field as the linear m/n code
- Use “primitive” element a
- All non-zero elements are powers of a
- Consists of n coordinates
- Additional properties if ai = ai
- Coordinate signature defined by
Algebraic signatures
12
Algebraic signatures
- Algebraic properties
- Assume that X and Y are large data objects:
- sig(X⊕Y) = sig(X) ⊕ sig(Y)
- sig(β⋅X) = β⋅sig(X)
- Multiplication is in the Galois field of the signature calculation
- Signatures and parity formation commute
- Signatures can be updated from the old signature and the
signature of the delta (XOR) between old and new data
- Signature calculation is fast!
- Hundreds of megabytes per second on a modern CPU
- Speed limited by disk bandwidth
13
Our algorithm
- Store data across
distributed system
- Challenge sites to prove
that they hold the data
- Sites respond with the
signatures of requested data
- Sites reveal tiny amount of
information: size of signature
- Challenger verifies that the
signatures are consistent
D1 D2 D3 P
14
Our algorithm
- Store data across
distributed system
- Challenge sites to prove
that they hold the data
- Sites respond with the
signatures of requested data
- Sites reveal tiny amount of
information: size of signature
- Challenger verifies that the
signatures are consistent
D1 D2 D3 P
Calculate signature of 32 byte ranges at 4+i×71, i = 5,…,20
14
Our algorithm
- Store data across
distributed system
- Challenge sites to prove
that they hold the data
- Sites respond with the
signatures of requested data
- Sites reveal tiny amount of
information: size of signature
- Challenger verifies that the
signatures are consistent
D1 D2 D3 P
Calculate signature of 32 byte ranges at 4+i×71, i = 5,…,20
14
Our algorithm
- Store data across
distributed system
- Challenge sites to prove
that they hold the data
- Sites respond with the
signatures of requested data
- Sites reveal tiny amount of
information: size of signature
- Challenger verifies that the
signatures are consistent
D1 D2 D3 P
Calculate signature of 32 byte ranges at 4+i×71, i = 5,…,20 sig1
14
Our algorithm
- Store data across
distributed system
- Challenge sites to prove
that they hold the data
- Sites respond with the
signatures of requested data
- Sites reveal tiny amount of
information: size of signature
- Challenger verifies that the
signatures are consistent
D1 D2 D3 P
Calculate signature of 32 byte ranges at 4+i×71, i = 5,…,20 sig1 sig2 sig3 sigp
14
Our algorithm
- Store data across
distributed system
- Challenge sites to prove
that they hold the data
- Sites respond with the
signatures of requested data
- Sites reveal tiny amount of
information: size of signature
- Challenger verifies that the
signatures are consistent
D1 D2 D3 P
sig1 sig2 sig3 sigp
14
Our algorithm
- Store data across
distributed system
- Challenge sites to prove
that they hold the data
- Sites respond with the
signatures of requested data
- Sites reveal tiny amount of
information: size of signature
- Challenger verifies that the
signatures are consistent
D1 D2 D3 P
sig1 ⊕ sig2 ⊕ sig3 ≟ sigP sig1 sig2 sig3 sigp
14
Collusion protection
D1 D2 D3
- Need for collusion protection
- All data and parity storing sites could collude to undetectably change the contents of data
(or just throw it away)
- Return signatures that are internally consistent
- Modify scheme to prevent collusion by:
- Using random m/n linear codes to generate parity blocks
- Blinding data or parity by XORing with pseudo-random stream
- Stream cipher seeded with block ID and secret known only to data owner
15
Collusion protection
D1 D2 D3 P1 P2
- Need for collusion protection
- All data and parity storing sites could collude to undetectably change the contents of data
(or just throw it away)
- Return signatures that are internally consistent
- Modify scheme to prevent collusion by:
- Using random m/n linear codes to generate parity blocks
- Blinding data or parity by XORing with pseudo-random stream
- Stream cipher seeded with block ID and secret known only to data owner
15
Collusion protection
D1 D2 D3 P1 P2 D1 D2 D3
- Need for collusion protection
- All data and parity storing sites could collude to undetectably change the contents of data
(or just throw it away)
- Return signatures that are internally consistent
- Modify scheme to prevent collusion by:
- Using random m/n linear codes to generate parity blocks
- Blinding data or parity by XORing with pseudo-random stream
- Stream cipher seeded with block ID and secret known only to data owner
15
Collusion protection
D1 D2 D3 P1 P2
random
D1 D2 D3 P1
- Need for collusion protection
- All data and parity storing sites could collude to undetectably change the contents of data
(or just throw it away)
- Return signatures that are internally consistent
- Modify scheme to prevent collusion by:
- Using random m/n linear codes to generate parity blocks
- Blinding data or parity by XORing with pseudo-random stream
- Stream cipher seeded with block ID and secret known only to data owner
15
Collusion protection
D1 D2 D3 P1 P2 D1 D2 D3 P1 P2
- Need for collusion protection
- All data and parity storing sites could collude to undetectably change the contents of data
(or just throw it away)
- Return signatures that are internally consistent
- Modify scheme to prevent collusion by:
- Using random m/n linear codes to generate parity blocks
- Blinding data or parity by XORing with pseudo-random stream
- Stream cipher seeded with block ID and secret known only to data owner
15
Blinding with a random stream
Data
Pseudo-random stream generator
16
- XOR data stream with a known pseudo-random stream
- Algebraic signature of blinded data is the XOR of the signature of
the original data and of the blinding stream
- Data owner need only keep the seed of the pseudo-random stream
- Reconstruct signature by recalculating stream
Blinding with a random stream
Data
Pseudo-random stream generator Seed1 Seed2
16
- XOR data stream with a known pseudo-random stream
- Algebraic signature of blinded data is the XOR of the signature of
the original data and of the blinding stream
- Data owner need only keep the seed of the pseudo-random stream
- Reconstruct signature by recalculating stream
Blinding with a random stream
Data
Pseudo-random stream generator Seed1 Seed2
Out
16
- XOR data stream with a known pseudo-random stream
- Algebraic signature of blinded data is the XOR of the signature of
the original data and of the blinding stream
- Data owner need only keep the seed of the pseudo-random stream
- Reconstruct signature by recalculating stream
Blinding with random stream
- Prevent colluding sites from discovering parity calculation
scheme by XORing blocks with pseudo-random data
- Several possibilities:
- Blind only parity
- Blind only data after calculating parity
- Calculate parity, then blind data and parity
- Blinding must be done after parity calculation
- Necessary to ensure that storage servers can’t solve for the
generator matrix
- Storage servers missing both pseudo-random stream and generator
- Blinding doesn’t prevent data recovery!
- Example: blinding just the parity means data can be read
17
Generating random m/n codes
- Linear m/n codes are defined over Galois field with 2f
elements
- Galois fields have addition, subtraction, multiplication,
division, 0, 1
- Same rules in Galois fields as for real, rational, complex, … numbers
- Code defined by a generator matrix G with m rows
and n columns
- G has special form ( Im | P )
- Im is the m×m identity matrix
- Every m×m submatrix of G is invertible
18
Generating random m/n codes
Vandermonde matrix Generator matrix
- Generation of generator matrix with n = 2f:
- Start with all 2f GF elements in a given order:
a1, a2, … an
- Order can be changed for a different code
- Form Vandermonde matrix
- Use Gaussian algorithm to transfer to desired form
19
Generating random m/n codes
- Given data bytes d1, d2, … dm, calculate parity bytes by:
(d1, d2, … dm)⋅G = (d1, d2, … dm, p1, p2, … , pk)
- Generate random code by starting with a random
permutation of the GF elements
- Pick a random code using a seed the data owner keeps
- If all sites collude:
- G can be reconstructed from known encoding
- This can be prevented if the sites don’t know the correct values
- f dx or px for the calculation
- Blinding ensures they don’t!
20
Securing a single site
- Use our scheme to secure data stored on a single site
- Can place all data and one or two parity chunks on a single site
- Single site cannot undetectably alter the contents
- Storage overhead can be made arbitrarily small
- Anyone knowing the erasure coding and the blinding scheme
can now check that the data is stored accurately
- May use any randomly-selected slice of the data
- Storage site can’t only keep data block hashes
21
Securing a single site
- Use our scheme to secure data stored on a single site
- Can place all data and one or two parity chunks on a single site
- Single site cannot undetectably alter the contents
- Storage overhead can be made arbitrarily small
- Anyone knowing the erasure coding and the blinding scheme
can now check that the data is stored accurately
- May use any randomly-selected slice of the data
- Storage site can’t only keep data block hashes
21
Example
- Create a backup peer-to-peer system
- Business model
- Clients
- Register with central site
- Clients pay annual registration fee
- Clients promise
- to store 10GB + ~2% on their computer
- to have their machine connected to the internet at least 20 hours every
24 hours
- Enterprise promises
- To establish an addressing scheme that allows clients to store data
remotely
- To verify that clients fulfill their promises
22
Example
- Use of our technique:
- Enterprise can verify that the data is stored correctly
- Client might encrypt data to prevent anyone else from
rebuilding it
- May use any desired encryption algorithm, as long as the parity is
calculated after encryption
- Blinding still has to be done...
- Signatures don’t leak much data anyway...
- Alternative
- Client that stores data also prepares a number of signature
challenges
- Client gives those to the enterprise
- Enterprise uses the challenge, but might run out of them
23
Future work
- Implementing a peer-to-peer backup system using this
approach
- Includes techniques from prior peer-to-peer backup systems
- Adds low-overhead verification
- Using this technique in POTSHARDS to verify that
remote servers are maintaining data
- Limited leakage is critical for this application
- Use this technique for storage service providers?
24
Conclusions
- We developed a scheme that can verify data stored on
storage sites outside of our administrative control
- Small challenges and responses
- Good for network load
- Limits information leakage to negligible sizes.
- Sufficient variety to force potential free-loaders to store the data
—not potential answers
- Secure against collusion
- Basic scheme for storage schemes that use parity or m/n
coding for high availability
- Extension to storing data on a single site
- Incurs arbitrarily small overhead from storing additional parity
data
25
Questions?
More information on the Web at http://www.ssrc.ucsc.edu/proj/archive.html Thanks to SSRC faculty, students, and sponsors!
26