secure indexing search for g regulatory compliant record
play

Secure Indexing/Search for g Regulatory-Compliant Record R - PowerPoint PPT Presentation

Secure Indexing/Search for g Regulatory-Compliant Record R Retention i 1 There is a need for trustworthy record keeping k i Email Instant Messaging Spending on Files Files eDiscovery Growing eDiscovery Growing at 65% CAGR


  1. Secure Indexing/Search for g Regulatory-Compliant Record R Retention i 1

  2. There is a need for trustworthy record keeping k i Email Instant Messaging Spending on Files Files eDiscovery Growing eDiscovery Growing at 65% CAGR Corporate Corporate Digital Digital Soaring Soaring Misconduct Misconduct I nform ation I nform ation Discovery Discovery Explosion Explosion Explosion Explosion Costs Costs Costs Costs Records Average F500 Average F500 Company Has 125 IDC Forecasts Non-Frivolous 60B Business Focus on Com pliance Focus on Com pliance Lawsuits at Any Emails Annually Given Time HIPAA Sources: IDC, Network World (2003), Socha / Gelbmann (2004) Q. Zhu, W. W. Hsu: Fossilized Index: The Linchpin of Trustworthy Non-Alterable Electronic Records. 2 SIGMOD’2006, 395-406, 2006

  3. What is trustworthy record keeping? Establish solid proof of events that have occurred Storage tim e Device Com m it Record Regret Query Alice Bob Adversary Bob should get back Alice’s data 3

  4. This leads to a unique threat model tim e ti Query is Commit is Adversary has trustworthy trustworthy super-user privileges • Access to storage device R Record is created d i d Record is R d i • Access to any keys properly queried properly Adversary could be Alice herself Adversary could be Alice herself 4

  5. Traditional schemes do not work tim e Cannot rely on Alice’s signature Cannot rely on Alice s signature 5

  6. WORM storage helps address the problem Record Overwrite/ New Record Delete Adversary cannot delete Alice’s record Write Once Read Many (WORM) 6

  7. WORM storage helps address the problem Record Overwrite/ New Record Delete Build on top of Build on top of conventional Adversary cannot rewritable magnetic delete Alice’s record disk, with write-once semantics enforced ti f d through software, with file modification Write Once Read Many and premature p deletion operations disallowed. 7

  8. Index required due to high volume of records Index tim e Com m it Record Query from Update I ndex I ndex Regret Bob Alice Adversary 8

  9. In effect, records can be hidden/altered by modifying the index dif i h i d Or replace B Hide record B Hide record B with B’ from the A B B B’ index The index must also be secured (fossilized) 9

  10. Btree for increasing sequence can be created on WORM d WORM 23 13 7 31 2 4 29 31 11 23 7 19 13 10

  11. B+tree index is insecure, even on WORM 23 25 7 13 31 27 4 7 11 13 19 23 29 31 25 26 30 2  Path to an element depends on elements inserted later – Adversary can attack it y 11

  12. Is this a real threat?  Would someone want to delete a record after  Would someone want to delete a record after a day its created?  Intrusion detection logging  Intrusion detection logging  Once adversary gain control, he would like to delete records of his initial attack delete records of his initial attack  Record regretted moments after creation  Email best practice - Must be committed E il b t ti M t b itt d before its delivered 12

  13. Several levels of indexing … 1 …query … …query … q y 3 … data … … base … …index … Keywords Query 1 3 11 17 3 9 Data Base 3 19 Posting Lists 7 36 Worm I d Index 3 3 To find documents containing keywords “Query” and “Data” and “Base” * Retrieve lists for Query Data and Base and intersect the document Retrieve lists for Query, Data and Base, and intersect the document ids in the list 13

  14. GHT: A Generalized Hash Tree Fossilized I d Index  Tree grows from the root down to the leaves  Tree grows from the root down to the leaves without relocating committed entries  “Balanced” without requiring dynamic  Balanced without requiring dynamic adjustments to its structure  For hash-based scheme dynamic hashing  For hash-based scheme, dynamic hashing scheme that do not require rehashing 14

  15. GHT Defined by {M,K, H} Defined by {M,K, H}  M = {m 0 , m 1 , …}, m i is  size of a tree node (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the  growth factor for level i growth factor for level i A tree has k i times as  many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a  m 0 = m 1 … = 4 hash function for level I k 0 = k 1 … = 2 Different H values lead to  different GHT variants 15

  16. Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H}  M = {m 0 , m 1 , …}, m i is  size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the  growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as  many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a  m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 16

  17. Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H}  M = {m 0 , m 1 , …}, m i is  size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the  growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as  many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a  m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 What about h 2 ? x mod 16? h 0 = x mod 4 0 h 1 = x mod 8 17

  18. Standard (Default) GHT – Thin Tree h 0 Defined by {M,K, H} Defined by {M,K, H}  M = {m 0 , m 1 , …}, m i is  size of a tree node h 1 (number of buckets) at (number of buckets) at level i K = {k 0 , k 1 ,…}, k i is the  growth factor for level i growth factor for level i h 2 h 2 h 2 h 2 A tree has k i times as  many nodes at level (i+1) as at level i H = {h 0 , h 1 ,…}, h i is a  m 0 = m 1 … = 4 hash function for level i k 0 = k 1 … = 2 h 0 = x mod 4 0 h 1 = x mod 8 h 2 = h 3 = … = x mod 8 18

  19. GHT Variant (Fat Tree) Can tolerate non-ideal hash functions better h 0 because there are many because there are many more potential target buckets at each level h 1 Hashing at different Hashing at different levels is independent h 2 Can allocate different levels to different disks and access them in parallel m 0 = m 1 … = 4 h 0 = x mod 4 Expensive to maintain k 0 = k 1 … = 2 h 1 = x mod 8 children pointers in each h 2 = x mod 16 node – number of h i = x mod 4*2 i i pointers grow exponentially 19

  20. GHT (Standard) Insertion Bucket = (Level, Child – left or right, Entry within bucket) (0, 0, 1) ( , , ) (1 1 2) (1, 1, 2) (2, 0, 1) 20

  21. GHT Insertion Insert whose hash values at the various levels are shown. (0, 0, 1) ( , , ) Occupied/ h0(key) = 1 collision (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 21

  22. GHT Insertion Insert whose hash values at the various levels are shown. ( , (0, 0, 1) , ) Occupied/ h0(key) = 1 collision (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 (3, 0, 3) If hash functions are uniform, tree grows top-down in a balanced fashion 22

  23. GHT Search Search for Search for whose hash values at the various levels are shown whose hash values at the various levels are shown - Similar to insertion - Need to deal with duplicate key values (0, 0, 1) ( , , ) h0(key) = 1 (1 1 2) (1, 1, 2) h1(key) = 6 (2, 0, 1) h2(key) = 1 h3(key) = 3 (3, 0, 3) Only for point queries   Cannot support range search 23

  24. Summary  Trustworthy record keeping is important  Trustworthy record keeping is important  However, need to also ensure efficient retrieval retrieval  Existing indexing structures may be manipulated manipulated  GHT is a “trustworthy” index structure  Once record is committed, it cannot be Once record is committed it cannot be manipulated! 24

  25. Most business records are unstructured, searched by inverted index h d b i d i d Keywords Posting Lists Query 1 3 11 17 Data 3 9 3 19 Base Worm 7 36 3 3 Index Index One WORM file for each posting list One WORM file for each posting list 25 S. Mitra, W. W. Hsu, M. Winslett: Trustworthy Keyword Search for Regulatory-Compliant Record Retention. VLDB’2006, 1001-1012, 2006

  26. Index must be updated as new documents arrive i Keywords Keywords Posting Lists Posting Lists Doc: 79 Query 1 3 11 17 79 Data Data 3 9 79 Query Query Base 3 19 Data Worm 7 36 Index Index 3 79  500 keywords = 500 disk seeks 500 k d 500 di k k  ~1 sec per document 26

  27. Amortize cost by updating in batch Buffer Keywords Keywords Posting Lists Posting Lists D Doc: 79 79 Query 79 81 83 Query 1 3 11 17 Doc: 80 Doc: 80 Data Data 3 9 Doc: 81 Base 3 19 Query Worm 7 36 Index 3 Doc: 82 Doc: 83  1 seek per keyword in batch  1 seek per keyword in batch Query Query  Large buffer to benefit infrequent terms  Over 100,000 documents to achieve 2 docs/sec Over 100 000 documents to achieve 2 docs/sec 27

  28. Index is not updated immediately Index Alice Alice Com m it tim e Record Alter Omit Buffer Buffer Adversary  Prevailing practice – email must be committed before it is delivered 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend