In-Place (Bijective) BWT Transforms Dominik Kppl Kyushu - - PowerPoint PPT Presentation

in place bijective bwt transforms
SMART_READER_LITE
LIVE PREVIEW

In-Place (Bijective) BWT Transforms Dominik Kppl Kyushu - - PowerPoint PPT Presentation

In-Place (Bijective) BWT Transforms Dominik Kppl Kyushu University Daiki Hashimoto Tohoku University Diptarama Ayumi Shinohara data structures Burrows-Wheeler Transform (BWT) [Burrows,Wheeler '94] Bijective BWT (BBWT) [Gil,Scott '12]


slide-1
SLIDE 1

In-Place (Bijective) BWT Transforms

Dominik Köppl Daiki Hashimoto Diptarama Ayumi Shinohara

Tohoku University Kyushu University

slide-2
SLIDE 2

2

data structures

Burrows-Wheeler Transform (BWT)

[Burrows,Wheeler '94]

Bijective BWT (BBWT)

[Gil,Scott '12]

slide-3
SLIDE 3

3

BWT of bacabbabb

T = bacabbabb$

slide-4
SLIDE 4

4

BWT of bacabbabb

T = bacabbabb$

bacabbabb$ acabbabb$ cabbabb$ abbabb$ bbabb$ babb$ abb$ bb$ b$ $

all suffjxes

slide-5
SLIDE 5

5

BWT of bacabbabb

T = bacabbabb$

bacabbabb$ acabbabb$ cabbabb$ abbabb$ bbabb$ babb$ abb$ bb$ b$ $ $ b a c a b b a b b

  • prev. char

all suffjxes

slide-6
SLIDE 6

6

BWT of bacabbabb

T = bacabbabb$

bacabbabb$ acabbabb$ cabbabb$ abbabb$ bbabb$ babb$ abb$ bb$ b$ $ bacabbabb$ acabbabb$ cabbabb$ abbabb$ bbabb$ babb$ abb$ bb$ b$ $ $ b a c a b b a b b $ b a c a b b a b b

align left

  • prev. char

all suffjxes

slide-7
SLIDE 7

7

BWT of bacabbabb

T = bacabbabb$

bacabbabb$ acabbabb$ cabbabb$ abbabb$ bbabb$ babb$ abb$ bb$ b$ $

BWT

<lex sort

bacabbabb$ acabbabb$ cabbabb$ abbabb$ bbabb$ babb$ abb$ bb$ b$ $ $ abb$ abbabb$ acabbabb$ babb$ b$ bacabbabb$ bb$ bbabb$ cabbabb$ $ b a c a b b a b b $ b a c a b b a b b b b c b b b $ a a a

align left

  • prev. char

all suffjxes

  • lex. order
slide-8
SLIDE 8

8

the BBWT is the BWT of the Lyndon factorization

  • f an input text

with respect to ≺ω

slide-9
SLIDE 9

9

the BBWT is the BWT of the Lyndon factorization

  • f an input text

with respect to ≺ω

1. 2.

slide-10
SLIDE 10

10

Lyndon words

– a – aabab

Lyndon word is smaller than

  • any proper suffix
  • any rotation
slide-11
SLIDE 11

11

Lyndon words

– a – aabab

not Lyndon words:

– abaab (rotation aabab smaller) – abab (abab not smaller than suffjx ab)

Lyndon word is smaller than

  • any proper suffix
  • any rotation
slide-12
SLIDE 12

12

Lyndon factorization [Chen+ '58]

  • input: text T =
  • output: factorization T1...Tt with

– Tx is Lyndon word – Tx ≥lex Tx+1 – factorization uniquely defjned – linear time [Duval'88]

(Chen-Fox-Lyndon Theorem) (Chen-Fox-Lyndon theorem)

T1 T2 ⋯ Tt

slide-13
SLIDE 13

13

example

T = bacabbabb Lyndon factorization: b|ac|abb|abb

– b,ac,abb, and abb are Lyndon – b >lex ac >lex abb ≥lex abb

slide-14
SLIDE 14

14

≺ω order

  • u ≺ω w :⟺ u u u u... <lex w w w w...
  • ab <lex aba
  • aba ≺ω ab
slide-15
SLIDE 15

15

≺ω order

  • u ≺ω w :⟺ u u u u... <lex w w w w...
  • ab <lex aba
  • aba ≺ω ab

abababab⋯ abaabaaba⋯

slide-16
SLIDE 16

16

BBWT of bacabbabb

b|ac|abb|abb

slide-17
SLIDE 17

17

BBWT of bacabbabb

b|ac|abb|abb

abb bab bba abb bab bba ac ca b

slide-18
SLIDE 18

18

BBWT of bacabbabb

b|ac|abb|abb

abb bab bba abb bab bba ac ca b ac ca abb bab bba abb bab bba b

slide-19
SLIDE 19

19

BBWT of bacabbabb

b|ac|abb|abb

abb bab bba abb bab bba ac ca b ac ca abb bab bba abb bab bba abb abb ac bab bab bba bba b ca

≺ω

b

slide-20
SLIDE 20

20

BBWT of bacabbabb

b|ac|abb|abb

abb bab bba abb bab bba ac ca b ac ca abb bab bba abb bab bba abb abb ac bab bab bba bba b ca abb abb ac bab bab bba bba b ca

BBWT

BBWT(T) = bbcbbaaba

≺ω

b b b c b b a a b a

slide-21
SLIDE 21

21

BBWT of bacabbabb

b|ac|abb|abb

abb bab bba abb bab bba ac ca b ac ca abb bab bba abb bab bba abb abb ac bab bab bba bba b ca abb abb ac bab bab bba bba b ca

BBWT

BBWT(T) = bbcbbaaba

≺ω

b b b c b b a a b a

BWT(T$) = bbcbbb$aaa

slide-22
SLIDE 22

22

motivation

properties of BBWT :

  • no $ necessary
  • BBWT is more compressible than BWT for various inputs

[Scott and Gill '12]

  • BBWT is indexible (full text index)
  • is computable in O(n) time with O(n) words

[Bannai+ '19] however, O(n) words can be too much for large n

slide-23
SLIDE 23

23

in-place computation

  • Σ: alphabet, σ := |Σ| alphabet size
  • T : text, n := |T|
  • L := n lg σ bits workspace
  • aim : in-place computation

transform T BWT BBWT with ↔ ↔ |L| + O(lg n) bits of workspace

b a c a b b a b b T := L

slide-24
SLIDE 24

24

known solutions

input

  • utput

work- space time reference

text BWT in-place O(n2) Crochemore+ '15 BWT text in-place O(n2+ε) text BBWT O(n lg σ) bits O(n lg n/lg lg n) Bonomo+ '14

σ : alphabet size, n : text length, ε is a constant with 0 < ε < 1

slide-25
SLIDE 25

25

text BWT BBWT O(n2) O(n2+ε) O(n2+ε) O(n2) working space: n lg σ + O(lg n) bits (including text) known

in-place conversions

O(n2+ε)

slide-26
SLIDE 26

26

forward search

T = bacabbabb$

L b b c b b b $ a a a F $ a a a b b b b b c

slide-27
SLIDE 27

27

forward search

T = bacabbabb$

L b b c b b b $ a a a F $ a a a b b b b b c

slide-28
SLIDE 28

28

forward search

T = bacabbabb$

L b b c b b b $ a a a F $ a a a b b b b b c

slide-29
SLIDE 29

29

forward search

T = bacabbabb$

L b b c b b b $ a a a F $ a a a b b b b b c

slide-30
SLIDE 30

30

forward search

can calculate with rank and select on F and L

T = bacabbabb$

L b b c b b b $ a a a F $ a a a b b b b b c

slide-31
SLIDE 31

31

forward search

FL mapping: FL(i) = L.selectF[i]( F.rankF[i](F[i]) )

T = bacabbabb$

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3 L.rankL[i](L[i]) F.rankF[i](F[i])

slide-32
SLIDE 32

32

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

backward search

T = bacabbabb$

L.rankL[i](L[i]) F.rankF[i](F[i])

FM index [Ferragina, Manzini '00]

slide-33
SLIDE 33

33

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

backward search

T = bacabbabb$

L.rankL[i](L[i]) F.rankF[i](F[i])

FM index [Ferragina, Manzini '00]

slide-34
SLIDE 34

34

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

backward search

T = bacabbabb$

L.rankL[i](L[i]) F.rankF[i](F[i])

FM index [Ferragina, Manzini '00]

slide-35
SLIDE 35

35

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

backward search

T = bacabbabb$

L.rankL[i](L[i]) F.rankF[i](F[i])

FM index [Ferragina, Manzini '00]

slide-36
SLIDE 36

36

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

backward search

LF mapping: LF(i) := F.selectL[i]( L.rankL[i](i) )

T = bacabbabb$

L.rankL[i](L[i]) F.rankF[i](F[i])

FM index [Ferragina, Manzini '00]

slide-37
SLIDE 37

37

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

backward search

LF mapping: LF(i) := F.selectL[i]( L.rankL[i](i) ) = F.selectL[i](1) + L.rankL[i](i)-1

T = bacabbabb$

L.rankL[i](L[i]) F.rankF[i](F[i])

FM index [Ferragina, Manzini '00]

slide-38
SLIDE 38

38

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

backward search

LF mapping: LF(i) := F.selectL[i]( L.rankL[i](i) ) = F.selectL[i](1) + L.rankL[i](i)-1 = |{ j : L[j] < L[i]}| + L.rankL[i](i)

T = bacabbabb$

L.rankL[i](L[i]) F.rankF[i](F[i])

FM index [Ferragina, Manzini '00]

slide-39
SLIDE 39

39

LF: time complexity

If we store BWT(T) in L :

– L[i] = BWT[i]: O(1) time

⇒ for any c : L.rankc(i) in O(n) time

– LF(i) = |{ j : L[j] < L[i]}| + L.rankL[i](i)

O(n) time O(n) time

slide-40
SLIDE 40

40

FL: time complexity

  • FL(i) = L.selectF[i]( F.rankF[i](F[i]) )

FL(i) = L.selectF[i] ( i - |{ j : L[j] < i}| )

  • If we know F[i]: FL(i) in O(n) time
  • however, the fastest in-place computation of

F[i] takes O(n1+ε) time [Munro,Raman '96]

for any constant ε with 0 < ε < 1

slide-41
SLIDE 41

41

text BWT BBWT O(n2+ε) O(n2+ε) O(n2) working space: n lg σ + O(lg n) bits (including text)

road map 1. 2.

slide-42
SLIDE 42

42

text BBWT →

slide-43
SLIDE 43

43

text BBWT →

for each Lyndon factor Tx with x = 1 up to t : prepend Tx[|Tx|] to BBWT p 1 (insert position ← in BBWT ) for each i = |Tx|-1 down to 1 : p LF( ← p) + 1 insert Tx[i] at BBWT[p]

[Bonomo+ '14]

slide-44
SLIDE 44

44

text BBWT →

T = bacabbabb

  • Lyndon factorization:

b|ac|abb|abb

  • fjrst: insert b
slide-45
SLIDE 45

45

text BBWT →

T = bacabbabb

  • Lyndon factorization:

b|ac|abb|abb

  • fjrst: insert b

F L

1

b b

1

slide-46
SLIDE 46

46

text BBWT →

T = bacabbabb

  • Lyndon factorization:

b|ac|abb|abb

  • fjrst: insert b

F L

1

a b

1 2

a b

2 3

a c

1 1

b b

3 2

b b

4 3

b a

1 4

b a

2 5

b b

5 1

c a

3

F L

1

b b

1

how to calculate?

slide-47
SLIDE 47

47

BBWT(T1 T2)

T = b|ac|abb|abb = T1 T2 T3 T4

  • next Lyndon factor: ac

F L

1

b b

1

slide-48
SLIDE 48

48

BBWT(T1 T2)

T = b|ac|abb|abb = T1 T2 T3 T4

  • next Lyndon factor: ac

F L

1

b c

1 1

c b

1

F L

1

b b

1

slide-49
SLIDE 49

49

BBWT(T1 T2)

T = b|ac|abb|abb = T1 T2 T3 T4

  • next Lyndon factor: ac

F L

1

b c

1 1

c b

1

F L

1

a c

1 1

b b

1 1

c a

1

F L

1

b b

1

slide-50
SLIDE 50

50

BBWT(T1 T2 T3)

T = b|ac|abb|abb

  • next Lyndon factor: abb

F L

1

a c

1 1

b b

1 1

c a

1

slide-51
SLIDE 51

51

BBWT(T1 T2 T3)

T = b|ac|abb|abb

  • next Lyndon factor: abb

F L

1

a c

1 1

b b

1 1

c a

1

F L

1

a b

1 1

b c

1 2

b b

2 1

c a

1

slide-52
SLIDE 52

52

BBWT(T1 T2 T3)

T = b|ac|abb|abb

  • next Lyndon factor: abb

F L

1

a c

1 1

b b

1 1

c a

1

F L

1

a b

1 1

b c

1 2

b b

2 1

c a

1

F L

1

a b

1 1

b c

1 2

b b

2 3

b b

3 1

c a

1

slide-53
SLIDE 53

53

BBWT(T1 T2 T3)

T = b|ac|abb|abb

  • next Lyndon factor: abb

F L

1

a b

1 2

a c

1 1

b b

2 2

b b

3 3

b a

1 1

c a

2

F L

1

a c

1 1

b b

1 1

c a

1

F L

1

a b

1 1

b c

1 2

b b

2 1

c a

1

F L

1

a b

1 1

b c

1 2

b b

2 3

b b

3 1

c a

1

slide-54
SLIDE 54

54

text BBWT → in-place

  • |bacabbabb
slide-55
SLIDE 55

55

text BBWT → in-place

  • |bacabbabb
  • b|acabbabb
slide-56
SLIDE 56

56

text BBWT → in-place

  • |bacabbabb
  • b|acabbabb
  • bac|abbabb
slide-57
SLIDE 57

57

text BBWT → in-place

  • |bacabbabb
  • b|acabbabb
  • bac|abbabb
  • cba|abbabb
slide-58
SLIDE 58

58

text BBWT → in-place

  • |bacabbabb
  • b|acabbabb
  • bac|abbabb
  • cba|abbabb
  • cbaabb|abb
slide-59
SLIDE 59

59

text BBWT → in-place

  • |bacabbabb
  • b|acabbabb
  • bac|abbabb
  • cba|abbabb
  • cbaabb|abb

slide-60
SLIDE 60

60

text BBWT → in-place

  • |bacabbabb
  • b|acabbabb
  • bac|abbabb
  • cba|abbabb
  • cbaabb|abb

  • bbcbbaaba|
slide-61
SLIDE 61

61

detailed transformation

c b a a b b

slide-62
SLIDE 62

62

detailed transformation

c b a a b b b c b a a b

slide-63
SLIDE 63

63

detailed transformation

c b a a b b b c b a a b

LF(1)= C[b] + L.rankb(1) = 2

where C[b] := |{ j : L[j] < b }|

slide-64
SLIDE 64

64

detailed transformation

c b a a b b b c b a a b b c b b a a

LF(1)= C[b] + L.rankb(1) = 2

where C[b] := |{ j : L[j] < b }|

slide-65
SLIDE 65

65

detailed transformation

c b a a b b b c b a a b b c b b a a

LF(1)= C[b] + L.rankb(1) = 2 LF(3)= C[b] + L.rankb(3) = 3

where C[b] := |{ j : L[j] < b }|

slide-66
SLIDE 66

66

detailed transformation

c b a a b b b c b a a b b c b b a a

LF(1)= C[b] + L.rankb(1) = 2

b c b a b a

LF(3)= C[b] + L.rankb(3) = 3

where C[b] := |{ j : L[j] < b }|

slide-67
SLIDE 67

67

BWT BBWT →

slide-68
SLIDE 68

68

BWT BBWT → in-place

  • Duval's algorithm

– computes Lyndon factorization – it runs in O(n tL) time,

where tL is the time for accessing an entry of T

  • algorithm uses linear scans from any T[i] to T[i+1]

⇒ emulate this with FL mapping ⇒ O(n2+ε) time only with L storing BWT

slide-69
SLIDE 69

69

BWT BBWT → in situ

T = b|ac|abb|abb

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

slide-70
SLIDE 70

70

BWT BBWT → in situ

T = b|ac|abb|abb

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

slide-71
SLIDE 71

71

BWT BBWT → in situ

T = b|ac|abb|abb

  • with FL mapping + Duval

we detect the fjrst Lyndon factor b|a ...

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

slide-72
SLIDE 72

72

construction of a cycle

T = b|ac|abb|abb

  • aim: create cycle b

→ b

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

slide-73
SLIDE 73

73

construction of a cycle

T = b|ac|abb|abb

  • aim: create cycle b

→ b

  • since FL maps $ to T[1] we

want to exchange $ and b

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

b 3

1 b

b 4

2 b

b 5

3 b

$ 1

4 b

a 1

5 b

a 2

1 c

a 3

slide-74
SLIDE 74

74

construction of a cycle

T = b|ac|abb|abb

  • aim: create cycle b

→ b

  • since FL maps $ to T[1] we

want to exchange $ and b

  • however: might not work
  • need to fjx red arrows

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

$ 3

1 b

b 4

2 b

b 5

3 b

b 1

4 b

a 1

5 b

a 2

1 c

a 3

1 2 1 1 3 4 5 1 2 3

slide-75
SLIDE 75

75

construction of a cycle

  • since there are two red

arrows:

  • switch below the

exchanged b the next two entries

F L

1 $

b 1

1 a

b 2

2 a

c 1

3 a

$ 3

1 b

b 4

2 b

b 5

3 b

b 1

4 b

a 1

5 b

a 2

1 c

a 3

1 2 1 1 3 4 5 1 2 3

slide-76
SLIDE 76

76

construction of a cycle

  • the cycle moved below

the exchange ⇒ modifjed LF mapping just “moved”

F L

1 $

b

1 1 a

b

2 2 a

c

1 3 a

$

3 1 b

b

4 2 b

b

5 3 b

a

1 4 b

a

1 5 b

b

2 1 c

a

3 1 2 1 1 3 4 1 2 5 3

slide-77
SLIDE 77

77

abstract idea

  • T1 ≥lex T2 ≥lex

⋯ ≥lex Tt ⇒ T[1..] >lex T[|T1|..]

F L x e e b $ e e e

T T1 $ b e x T2

slide-78
SLIDE 78

78

abstract idea

  • T1 ≥lex T2 ≥lex

⋯ ≥lex Tt ⇒ T[1..] >lex T[|T1|..]

F L x e e b $ e e e

T T1 $ b e x T2

slide-79
SLIDE 79

79

abstract idea

  • T1 ≥lex T2 ≥lex

⋯ ≥lex Tt ⇒ T[1..] >lex T[|T1|..]

  • need to change red

arrows

F L x e e b $ e e e

T T1 $ b e x T2

slide-80
SLIDE 80

80

abstract idea

F L x $ e b e e e e

T T1 $ b e x T2

slide-81
SLIDE 81

81

abstract idea

F L x $ e b e e e e

T T1 $ b e x T2 the number of e's between the exchanged $ and e = the number of entries to switch after the e in F that mapped to the exchanged e

slide-82
SLIDE 82

82

carrying on with example

F L

1 $

b

1 1 a

b

2 2 a

c

1 3 a

$

1 1 b

b

3 2 b

b

4 3 b

a

1 4 b

a

2 5 b

b

5 1 c

a

3

slide-83
SLIDE 83

83

carrying on with example

F L

1 $

b

1 1 a

b

2 2 a

c

1 3 a

$

1 1 b

b

3 2 b

b

4 3 b

a

1 4 b

a

2 5 b

b

5 1 c

a

3

slide-84
SLIDE 84

84

carrying on with example

F L

1 $

b

1 1 a

b

2 2 a

c

1 3 a

$

1 1 b

b

3 2 b

b

4 3 b

a

1 4 b

a

2 5 b

b

5 1 c

a

3

F L

1 $

b

1 1 a

b

2 2 a

c

1 3 a

$

1 1 b

b

3 2 b

b

4 3 b

a

1 4 b

a

2 5 b

b

5 1 c

a

3

slide-85
SLIDE 85

85

carrying on with example

F L

1 $

b

1 1 a

b

2 2 a

c

1 3 a

$

1 1 b

b

3 2 b

b

4 3 b

a

1 4 b

a

2 5 b

b

5 1 c

a

3

F L

1 $

b

1 1 a

b

2 2 a

$

1 3 a

c

1 1 b

b

3 2 b

b

4 3 b

a

1 4 b

a

2 5 b

b

5 1 c

a

3

F L

1 $

b

1 1 a

b

2 2 a

c

1 3 a

$

1 1 b

b

3 2 b

b

4 3 b

a

1 4 b

a

2 5 b

b

5 1 c

a

3

slide-86
SLIDE 86

86

  • pen problems
  • can we get rid of the FL mapping?

(use only LF mapping)

  • trade-ofg algorithm for time

space ↔

  • Is the number of distinct Lyndon words of T bounded

by the runs in the BBWT of T ?

if so: O(r) words run-length compressed BBWT-index (r : runs in BBWT)

O(n1+ε) time

slide-87
SLIDE 87

87

text BWT BBWT O(n2) O(n2+ε) O(n2+ε) O(n2) working space: n lg σ + O(lg n) bits (including text) known

in-place conversions

O(n2+ε) not shown here O(n2+ε)

slide-88
SLIDE 88

88

text BWT BBWT O(n2) O(n2+ε) O(n2+ε) O(n2) working space: n lg σ + O(lg n) bits (including text) known

in-place conversions

O(n2+ε) not shown here O(n2+ε) any questions are welcome!