Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud - - PowerPoint PPT Presentation

data deduplication with random substitutions
SMART_READER_LITE
LIVE PREVIEW

Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud - - PowerPoint PPT Presentation

Data Deduplication with Random Substitutions Hao Lou Farzad Farnoud Electrical and Computer Engineering University of Virginia { haolou,farzad } @virginia.edu June 8, 2020 Lou, Farnoud ISIT2020 1 / 21 Data explosion David Reinsel, John


slide-1
SLIDE 1

Data Deduplication with Random Substitutions

Hao Lou Farzad Farnoud

Electrical and Computer Engineering University of Virginia {haolou,farzad}@virginia.edu

June 8, 2020

Lou, Farnoud ISIT2020 1 / 21

slide-2
SLIDE 2

Data explosion

David Reinsel, John Gantz, and John Rydning. “The digitization of the world: from edge to core”. In: IDC White Paper (2018)

Lou, Farnoud ISIT2020 2 / 21

slide-3
SLIDE 3

Data deduplication

Efficient data reduction approach: data deduplication.

Lou, Farnoud ISIT2020 3 / 21

slide-4
SLIDE 4

Data deduplication

Efficient data reduction approach: data deduplication.

Deduplication system

Chunk size: fixed-length variable-length (content defined)

Lou, Farnoud ISIT2020 3 / 21

slide-5
SLIDE 5

Data deduplication

Compared with traditional compression methods (LZ compression): eliminate chunk- (8KB)1 or file-level redundancy

1Athicha Muthitacharoen, Benjie Chen, and David Mazieres. “A low-bandwidth

network file system”. In: Proceedings of the eighteenth ACM symposium on Operating systems principles. 2001, pp. 174–187.

Lou, Farnoud ISIT2020 4 / 21

slide-6
SLIDE 6

Data deduplication

Compared with traditional compression methods (LZ compression): eliminate chunk- (8KB)1 or file-level redundancy hash-based fingerprint, no byte by byte comparison.

1Athicha Muthitacharoen, Benjie Chen, and David Mazieres. “A low-bandwidth

network file system”. In: Proceedings of the eighteenth ACM symposium on Operating systems principles. 2001, pp. 174–187.

Lou, Farnoud ISIT2020 4 / 21

slide-7
SLIDE 7

Data deduplication

Compared with traditional compression methods (LZ compression): eliminate chunk- (8KB)1 or file-level redundancy hash-based fingerprint, no byte by byte comparison. ⇒ deduplication methods are more efficient for large-scale storage systems.

1Athicha Muthitacharoen, Benjie Chen, and David Mazieres. “A low-bandwidth

network file system”. In: Proceedings of the eighteenth ACM symposium on Operating systems principles. 2001, pp. 174–187.

Lou, Farnoud ISIT2020 4 / 21

slide-8
SLIDE 8

An Information-theoretic point of view

Information source model: data stream. Introduction of deduplication algorithms. Performance analysis of deduplication algorithms.

Lou, Farnoud ISIT2020 5 / 21

slide-9
SLIDE 9

Existing work

Urs Niesen. “An information-theoretic analysis of deduplication”. In: IEEE Transactions on Information Theory 65.9 (2019), pp. 5688–5704 Rasmus Vestergaard, Qi Zhang, and Daniel E Lucani. “Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties”. In: arXiv preprint arXiv:1901.02720 (2019) Laura Conde-Canencia, Tyson Condie, and Lara Dolecek. “Data deduplication with edit errors”. In: 2018 IEEE Global Communications Conference (GLOBECOM). IEEE. 2018, pp. 1–6

Lou, Farnoud ISIT2020 6 / 21

slide-10
SLIDE 10

Source model

La ∼ integer distribution Ps, Xa ∼ {0, 1}La.

Lou, Farnoud ISIT2020 7 / 21

slide-11
SLIDE 11

Source model

La ∼ integer distribution Ps, Xa ∼ {0, 1}La. Source symbols: a = 1, 2, . . . , A

X1 X2

· · ·

XA

Lou, Farnoud ISIT2020 7 / 21

slide-12
SLIDE 12

Source model

La ∼ integer distribution Ps, Xa ∼ {0, 1}La. Source symbols: a = 1, 2, . . . , A

X1 X2

· · ·

XA

Xn1, Xn2, . . . , XnB, drawn with replacement from {X1, X2, . . . , XA}.

Lou, Farnoud ISIT2020 7 / 21

slide-13
SLIDE 13

Source model

La ∼ integer distribution Ps, Xa ∼ {0, 1}La. Source symbols: a = 1, 2, . . . , A

X1 X2

· · ·

XA

Xn1, Xn2, . . . , XnB, drawn with replacement from {X1, X2, . . . , XA}. Concatenation and substitution:

Xn1 Xn2 · · · XnB Y1 Y2 · · · YB

substitutions

Data stream s =

Lou, Farnoud ISIT2020 7 / 21

slide-14
SLIDE 14

Source model

La ∼ integer distribution Ps, Xa ∼ {0, 1}La. Source symbols: a = 1, 2, . . . , A

X1 X2

· · ·

XA

Xn1, Xn2, . . . , XnB, drawn with replacement from {X1, X2, . . . , XA}. Concatenation and substitution:

Xn1 Xn2 · · · XnB Y1 Y2 · · · YB

substitutions

Data stream s = Y1, Y2, . . . , YB: source blocks.

Lou, Farnoud ISIT2020 7 / 21

slide-15
SLIDE 15

Source model

Substitutions: each bit has probability δ ≤ 1/2 to be flipped independently, δ is a constant.

Lou, Farnoud ISIT2020 8 / 21

slide-16
SLIDE 16

Source model

Substitutions: each bit has probability δ ≤ 1/2 to be flipped independently, δ is a constant. Example: A = 3, B = 4.

000

X1:

0010 11

X2: X3:

000 11 000 0010

Xn1 = X2 Xn2 = X1 Xn3 = X3 Xn4 = X1

1010 000 10 001

Y1 Y2 Y3 Y4

⇒ s = 101000010001

Lou, Farnoud ISIT2020 8 / 21

slide-17
SLIDE 17

Source model

Assumptions: Length distribution Ps: mean L, Ps(L/2 ≤ La ≤ 2L) = 1.

Lou, Farnoud ISIT2020 9 / 21

slide-18
SLIDE 18

Source model

Assumptions: Length distribution Ps: mean L, Ps(L/2 ≤ La ≤ 2L) = 1. Asymptotically, A = o(B1−ǫ), 0 < ǫ < 1, L = B1/k, k > 1.

Lou, Farnoud ISIT2020 9 / 21

slide-19
SLIDE 19

Entropy

Entropy: H(s) H(δ)BL ≤ H(s) ≤H(δ)BL + o(BL) as B → ∞. BL : expected length of s.

Lou, Farnoud ISIT2020 10 / 21

slide-20
SLIDE 20

Deduplication scheme

Double fixed-length deduplication:

s =

· · · · · · S1

D

S2 S3

· · · · · ·

SK Z1

1

Z1

2 · · · Z1 C

Z2

1 · · · Z2 C · · · · · ·

s is parsed into segments of length D.

Lou, Farnoud ISIT2020 11 / 21

slide-21
SLIDE 21

Deduplication scheme

Double fixed-length deduplication:

s =

· · · · · · S1

D

S2 S3

· · · · · ·

SK Z1

1

Z1

2 · · · Z1 C

Z2

1 · · · Z2 C · · · · · ·

s is parsed into segments of length D. Each Sk are further parsed into chunks of length ℓ.

Lou, Farnoud ISIT2020 11 / 21

slide-22
SLIDE 22

Deduplication scheme

Example: D = 5, ℓ = 3. s =00001011010 ⇒ s =00001|01101|0 ⇒ s =000|01|011|01|0 chunks: 000, 01, 011, 01, 0.

Lou, Farnoud ISIT2020 12 / 21

slide-23
SLIDE 23

Deduplication scheme

Double fixed-length deduplication:

Prefix-free code for |s|.

Lou, Farnoud ISIT2020 13 / 21

slide-24
SLIDE 24

Deduplication scheme

Double fixed-length deduplication:

Prefix-free code for |s|. Chunks are processed sequentially:

Lou, Farnoud ISIT2020 13 / 21

slide-25
SLIDE 25

Deduplication scheme

Double fixed-length deduplication:

Prefix-free code for |s|. Chunks are processed sequentially:

First time appearance: 1 + itself. Added to dictionary.

Lou, Farnoud ISIT2020 13 / 21

slide-26
SLIDE 26

Deduplication scheme

Double fixed-length deduplication:

Prefix-free code for |s|. Chunks are processed sequentially:

First time appearance: 1 + itself. Added to dictionary. Not first time: 0 + pointer to dictionary.

Lou, Farnoud ISIT2020 13 / 21

slide-27
SLIDE 27

Deduplication scheme

Example: s = 00001011010 ⇒ 000, 01, 011, 01, 0.

Lou, Farnoud ISIT2020 14 / 21

slide-28
SLIDE 28

Deduplication scheme

Example: s = 00001011010 ⇒ 000, 01, 011, 01, 0. Compressed string:

0001011 + 1000 + 101

+

1011 + 001

+

10

Lou, Farnoud ISIT2020 14 / 21

slide-29
SLIDE 29

Deduplication scheme

Example: s = 00001011010 ⇒ 000, 01, 011, 01, 0. Compressed string:

0001011 + 1000 + 101

+

1011 + 001

+

10

0001011: Elias γ code for |s| = 11.

Lou, Farnoud ISIT2020 14 / 21

slide-30
SLIDE 30

Deduplication scheme

Example: s = 00001011010 ⇒ 000, 01, 011, 01, 0. Compressed string:

0001011 + 1000 + 101

+

1011 + 001

+

10

0001011: Elias γ code for |s| = 11. 1000,101,1011,10: 1st occurrence of chunk 000,01,011,0.

Lou, Farnoud ISIT2020 14 / 21

slide-31
SLIDE 31

Deduplication scheme

Example: s = 00001011010 ⇒ 000, 01, 011, 01, 0. Compressed string:

0001011 + 1000 + 101

+

1011 + 001

+

10

0001011: Elias γ code for |s| = 11. 1000,101,1011,10: 1st occurrence of chunk 000,01,011,0. 001: 2nd occurrence of 01.

Lou, Farnoud ISIT2020 14 / 21

slide-32
SLIDE 32

Performance analysis

Setting: Source model: Ps(La = L) = 1.

Lou, Farnoud ISIT2020 15 / 21

slide-33
SLIDE 33

Performance analysis

Setting: Source model: Ps(La = L) = 1. First-level parsing length: pick D = L.

Lou, Farnoud ISIT2020 15 / 21

slide-34
SLIDE 34

Performance analysis

Setting: Source model: Ps(La = L) = 1. First-level parsing length: pick D = L. s =

Y1 Y2 Y3 · · · · · · YB Y1

L

Y2 Y3

· · · · · ·

YB Z1

1

Z1

2 · · · Z1 C

Z2

1

Z2

2 · · · Z2 C

· · · · · ·

ZB

1

ZB

2 · · · ZB C

Length of compressed version of s: LF (s).

Lou, Farnoud ISIT2020 15 / 21

slide-35
SLIDE 35

Performance analysis

Theorem

As B → ∞, with optimal ℓ, E[LF (s)] H(s) ≤ 2 ǫ (1 + 1 k). A = o(B1−ǫ), L = B1/k.

Lou, Farnoud ISIT2020 16 / 21

slide-36
SLIDE 36

Deduplication scheme

Variable-length deduplication2

2Urs Niesen. “An information-theoretic analysis of deduplication”. In: IEEE

Transactions on Information Theory 65.9 (2019), pp. 5688–5704.

Lou, Farnoud ISIT2020 17 / 21

slide-37
SLIDE 37

Deduplication scheme

Variable-length deduplication2

Example: s = 0100100011000, M = 2: s = 0100|100|01100|0, chunks: 0100, 100, 01100, 0.

2Urs Niesen. “An information-theoretic analysis of deduplication”. In: IEEE

Transactions on Information Theory 65.9 (2019), pp. 5688–5704.

Lou, Farnoud ISIT2020 17 / 21

slide-38
SLIDE 38

Deduplication scheme

Variable-length deduplication:

Prefix-free code for |s|. Chunks are processed sequentially:

First time appearance: 1 + itself. Added to dictionary. Not first time: 0 + pointer to dictionary.

LV L(s): compressed string length of s.

Lou, Farnoud ISIT2020 18 / 21

slide-39
SLIDE 39

Performance analysis

Theorem

The performance of variable-length deduplication with the optimal marker length M satisfies E[LV L(s)] BL ≤ c(δ), as B → ∞, where c(δ) is decreasing in δ.

Lou, Farnoud ISIT2020 19 / 21

slide-40
SLIDE 40

Performance analysis

Theorem

The performance of variable-length deduplication with the optimal marker length M satisfies E[LV L(s)] BL ≤ c(δ), as B → ∞, where c(δ) is decreasing in δ.

c(δ) H(δ) is increasing unbounded as δ → 0.

Entropy: H(δ)BL

Lou, Farnoud ISIT2020 19 / 21

slide-41
SLIDE 41

Performance analysis

10-5 10-4 10-3 10-2 10-1 10-4 10-3 10-2 10-1 100 101

Upper bound on E[LV L(s)]

BL

and H(δ) vs the error rate δ, with the optimal marker length, A = L = B1/2, and δ ranging from 10−5 to 10−1.

Lou, Farnoud ISIT2020 20 / 21

slide-42
SLIDE 42

Thank you!

Lou, Farnoud ISIT2020 21 / 21