Reducing Replication Bandwidth for Distributed Document Databases - PowerPoint PPT Presentation

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 , Sudipta Sengupta 2 Jin Li 2 , Greg Ganger 1 Carnegie Mellon University 1 , Microsoft Research 2

Document-oriented Databases { { " _id " : "55ca4cf7bad4f75b8eb5c25c", " _id " : "55ca4cf7bad4f75b8eb5c25d”, " pageId " : "46780", " pageId " : "46780", " revId " : " 41173 ", " revId " : "128520", " timestamp " : "2002-03-30T20:06:22", " timestamp " : "2002-03-30T20:11:12", Update " sha1 " : "6i81h1zt22u1w4sfxoofyzmxd” " sha1 " : "q08x58kbjmyljj4bow3e903uz” " text " : “The Peer and the Peri is a " text " : "The Peer and the Peri is a comic [[Gilbert and Sullivan]] comic [[Gilbert and Sullivan]] [[operetta ]] in two acts… just as [[operetta ]] in two acts… just as predicting,…The fairy Queen, however, predicted, …The fairy Queen, on the other appears to … all live happily ever after. " hand, is ''not'' happy, and appears to … all } live happily ever after. " } Update: Reading a recent doc and writing back a similar one 2 ¡

Replication Bandwidth { { Primary " _id " : "55ca4cf7bad4f75b8eb5c25c", " _id " : "55ca4cf7bad4f75b8eb5c25d”, " pageId " : "46780", " pageId " : "46780", Database " revId " : " 41173 ", " revId " : "128520", " timestamp " : " 2002-‑03-‑30T20:06:22Z ", " timestamp " : "2002-03-30T20:11:12Z", " sha1 " : "6i81h1zt22u1w4sfxoofyzmxd” Goal: Reduce bandwidth " sha1 " : "q08x58kbjmyljj4bow3e903uz” Operation Operation " text " : "The Peer and the Peri” is a " text " : "The Peer and the Peri” is a logs logs comic [[Gilbert and Sullivan]] comic [[Gilbert and Sullivan]] WAN for WAN geo-replication [[operetta ]] in two acts… just as [[operetta ]] in two acts… just as predicting,…The fairy Queen, however, predicted, …The fairy Queen, on the other appears to … all live happily ever after. " hand, is ''not'' happy, and appears to … all } live happily ever after. " } Secondary Secondary 3 ¡

Why Deduplication? • Why not just compress ? – Oplog batches are small and not enough overlap • Why not just use diff ? – Need application guidance to identify source • Dedup finds and removes redundancies – In the entire data corpus 4 ¡

Traditional Dedup: Ideal Chunk Boundary Modified Region Duplicate Region Incoming {BYTE STREAM } Data 1 2 3 4 5 Send dedup’ed Deduped data to replicas 1 2 4 5 Data 5 ¡

Traditional Dedup: Reality Chunk Boundary Modified Region Duplicate Region Incoming Data 1 2 3 4 5 Send almost the Deduped 4 entire document. Data 6 ¡

Similarity Dedup Chunk Boundary Modified Region Duplicate Region Incoming Data Delta! Only send delta Dedup’ed encoding. Data 7 ¡

Compress vs. Dedup 20GB sampled Wikipedia dataset MongoDB v2.7 // 4MB Oplog batches 8 ¡

sDedup: Similarity Dedup Client Insertion & Updates Oplog Oplog syncer Unsynchronized Source Source oplog entries documents documents sDedup sDedup Encoder Decoder Database Database Re-constructed oplog entries Source Dedup’ed Replay Document Oplog oplog entries Cache Primary Node Secondary Node 9 ¡

sDedup Encoding Steps • Identify Similar Documents • Select the Best Match • Delta Compression 10 ¡

Identify Similar Documents Target Document Similarity Candidate Documents Rabin Chunking Score 32 17 25 41 12 1 Doc #1 39 32 22 15 Doc #1 Doc #2 32 25 38 41 12 Consistent Sampling Doc #3 2 32 17 38 41 12 Similarity Sketch 41 32 Doc #2 Doc #2 32 25 38 41 12 2 Feature 32 Doc #3 32 17 38 41 12 Index Table Doc #3 41 11 ¡

Select the Best Match Initial Ranking Final Ranking Rank Candidates Score Rank Candidates Cached? Score Doc #2 Doc #3 1 2 1 Yes 4 Doc #3 Doc #1 1 2 1 Yes 3 Doc #1 Doc #2 2 1 2 No 2 Is doc cached? If yes, reward +2 Source Document Cache 12 ¡

Evaluation • MongoDB setup (v2.7) – 1 primary, 1 secondary node, 1 client – Node Config: 4 cores, 8GB RAM, 100GB HDD storage • Datasets: – Wikipedia dump (20GB out of ~12TB) – Additional datasets evaluated in the paper 13 ¡

Compression sDedup trad-dedup 50 Compression Ratio 38.9 38.4 40 26.3 30 20 15.2 9.9 9.1 10 4.6 2.3 0 4KB 1KB 256B 64B Chunk Size 20GB sampled Wikipedia dataset 14 ¡

Memory sDedup trad-dedup 780.5 800 Memory (MB) 600 400 272.5 200 133.0 80.2 57.3 61.0 47.9 34.1 0 4KB 1KB 256B 64B Chunk Size 20GB sampled Wikipedia dataset 15 ¡

Other Results (See Paper) • Negligible client performance overhead • Failure recovery is quick and easy • Sharding does not hurt compression rate • More datasets – Microsoft Exchange, Stack Exchange 16 ¡

Conclusion & Future Work • sDedup : Similarity-based deduplication for replicated document databases. – Much greater data reduction than traditional dedup – Up to 38x compression ratio for Wikipedia – Resource-efficient design with negligible overhead • Future work – More diverse datasets – Dedup for local database storage – Different similarity search schemes (e.g., super-fingerprints) 17 ¡

Backup Slides 18 ¡

Compression: StackExchange sDedup trad-dedup 5 Compression Ratio 4 3 1.8 2 1.3 1.2 1.2 1.1 1.0 1.0 1.0 1 0 4KB 1KB 256B 64B Chunk Size 10GB sampled StackExchange dataset 19 ¡

Memory: StackExchange sDedup trad-dedup 3500 3,082.5 3000 Memory (MB) 2500 2000 1500 899.2 1000 439.8 414.3 302.0 500 228.4 115.4 83.9 0 4KB 1KB 256B 64B Chunk Size 10GB sampled StackExchange dataset 20 ¡

Throughput Overhead 21 ¡

Failure Recovery Failure Point 20GB sampled Wikipedia dataset. 22 ¡

Dedup + Sharding 50 Compression Ratio 38.4 38.2 38.1 37.9 40 30 20 10 0 1 3 5 9 Number of Shards 20GB sampled Wikipedia dataset 23 ¡

Delta Compression • Byte-level diff between source and target docs: – Based on the xDelta algorithm – Improved speed with minimal loss of compression • Encoding: – Descriptors about duplicate/unique regions + unique bytes • Decoding: – Use source doc + encoded output – Concatenate byte regions in order 24 ¡

Reducing Replication Bandwidth for Distributed Document Databases - PowerPoint PPT Presentation

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 , Sudipta Sengupta 2 Jin Li 2 , Greg Ganger 1 Carnegie Mellon University 1 , Microsoft Research 2 Document-oriented Databases { { " _id " :

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 ,

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 ,

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

DISTRIBUTED SYSTEMS II REPLICATION CNT. II The Quorum consensus method for Replication To

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

Replication and Migration Background, Requirements and Strawman Migration and Replication

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Distributed Systems (3rd Edition) Chapter 07: Consistency & Replication Version: February

promises and pitfalls Lina Gega, Paul Strickland, Owen Barry, Peter Langdon & Leen

L OS A NGELES B ASIN S TORMWATER C ONSERVATION S TUDY Los Angeles County Flood Control District

CDC PUBLIC HEALTH GRAND ROUNDS Autism Spectrum Disorder: From Numbers to Know-How April pril

Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias

TH THE E RU RURI RITAGE pr project t as as a a too ool l for or local de develo

Academic Library Oregon Library Association Annual Conference 1 8 April 201 4, Salem Oregon

Trust Enabled Supply Networks Uncovering the trust-building secrets of highly collaborative

Evaluation verschiedener 3D-Drucker Seminar Technische Informatik, Wintersemester 2013/2014

Reducing Replication Bandwidth for Distributed Document Databases - PowerPoint PPT Presentation

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 , Sudipta Sengupta 2 Jin Li 2 , Greg Ganger 1 Carnegie Mellon University 1 , Microsoft Research 2 Document-oriented Databases { { " _id " :

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 ,

Reducing Replication Bandwidth for Distributed Document Databases Lianghong Xu 1 , Andy Pavlo 1 ,

New features in MySQL Replication Lars Thalmann, Development Manager, Replication &amp; Backup

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

DISTRIBUTED SYSTEMS II REPLICATION CNT. II The Quorum consensus method for Replication To

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

Replication and Migration Background, Requirements and Strawman Migration and Replication

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Distributed Systems (3rd Edition) Chapter 07: Consistency &amp; Replication Version: February

promises and pitfalls Lina Gega, Paul Strickland, Owen Barry, Peter Langdon &amp; Leen

L OS A NGELES B ASIN S TORMWATER C ONSERVATION S TUDY Los Angeles County Flood Control District

CDC PUBLIC HEALTH GRAND ROUNDS Autism Spectrum Disorder: From Numbers to Know-How April pril

Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias

TH THE E RU RURI RITAGE pr project t as as a a too ool l for or local de develo

Academic Library Oregon Library Association Annual Conference 1 8 April 201 4, Salem Oregon

Trust Enabled Supply Networks Uncovering the trust-building secrets of highly collaborative

Evaluation verschiedener 3D-Drucker Seminar Technische Informatik, Wintersemester 2013/2014

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Distributed Systems (3rd Edition) Chapter 07: Consistency & Replication Version: February

promises and pitfalls Lina Gega, Paul Strickland, Owen Barry, Peter Langdon & Leen