SALZA: Algorithmic Information Theory and Universal Classification - - PowerPoint PPT Presentation

salza algorithmic information theory and universal
SMART_READER_LITE
LIVE PREVIEW

SALZA: Algorithmic Information Theory and Universal Classification - - PowerPoint PPT Presentation

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA: Algorithmic Information Theory and Universal Classification for Sequences SeqBio 2018, Rouen, France Franois Cayre, Nicolas Le


slide-1
SLIDE 1

General information measures SALZA similarity as information Applications of SALZA Parallel implementation

SALZA: Algorithmic Information Theory and Universal Classification for Sequences

SeqBio 2018, Rouen, France François Cayre, Nicolas Le Bihan and Marion Revolle

GIPSA-Lab | DIS | CICS

November 19th, 2018

1 / 34

slide-2
SLIDE 2

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Definitions Examples

Axioms for measuring information

Definition (General information measure [Steudel et al., 2010]) Let X be a set of discrete-valued r.v., Ω = 2X be the set of subsets,

(Ω,∧,∨) be a finite lattice and 0 be the meet of all elements.

R : Ω → R is an information measure if it satisfies :

1

Normalization : R(0) = 0 ;

2

Monotonicity : ∀s,t ∈ Ω, s ≤ t =

⇒ R(s) ≤ R(t) ;

3

Submodularity : ∀s,t ∈ Ω, R(s)+ R(t) ≥ R(s ∨ t)+ R(s ∧ t). Definition (Conditional mutual information [Steudel et al., 2010])

∀s,t,u ∈ Ω, I(s : t|u) = R(s ∨ u)+ R(t ∨ u)− R(s ∨ t ∨ u)− R(u).

s and t are said to be independent given u if I(s : t|u) = 0.

2 / 34

slide-3
SLIDE 3

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Definitions Examples

Deriving information theory

Lemma (Non-negativity of mutual information and conditioning [Steudel et al., 2010])

∀s,t,u ∈ Ω, the following hold :

1

0 ≤ I(s : t|u) ;

2

0 ≤ I(s|t,u) ≤ I(s|t). Lemma (Chain rule [Steudel et al., 2010])

∀s,t,u,x ∈ Ω, I(s : t ∨ u|x) = I(s : t|x)+ I(s : u|t,x).

Lemma (Data processing inequality [Steudel et al., 2010])

∀s,t,x ∈ Ω, R(s|t) = 0 = ⇒ I(s : x|t) = 0 = ⇒ I(s : x) ≤ I(t : x).

3 / 34

slide-4
SLIDE 4

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Definitions Examples

Examples of information measures [Steudel et al., 2010]

Common examples Shannon entropy of r.v. ; Kolmogorov complexity of binary strings ; Period length of time series ; Size of vocabulary in a text. Complexity/Compression-based Lempel-Ziv complexity (LZ76) ; Grammar-based compression ; LZ77 ? Ziv-Merhav ? (now, that’s a cliffhanger !)

4 / 34

slide-5
SLIDE 5

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Sequences

Definition (Sequences) A sequence x is defined as a finite succession of symbols drawn from a countable alphabet A. Let |x| be the length of the sequence x, the empty sequence is /

0.

A+ is the set of all non-empty sequences and A⋆ = /

0∪A+.

In a set of n sequences x1,...,xn, the first k sequences are denoted by x≤k and x≤0 = /

0.

5 / 34

slide-6
SLIDE 6

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA factorizations

Definition (Prior knowledge R and factorizations) Given sequences y,x1,...,xn ∈ A⋆, the notation y ≀ x1,...,xn stands for the generic case and denotes any of the following canonical factorizations :

1

y|x1,...,xn : R is the past of y and the entirety of x1,...,xn

→ LZ77-based factorization ;

2

y|+x1,...,xn : R is the entirety of x1,...,xn

→ Ziv-Merhav-based factorization.

6 / 34

slide-7
SLIDE 7

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA factorizations in picture

y

Past (already factorized) To be factorized

x1 x2 x3 . . . . . xn

7 / 34

slide-8
SLIDE 8

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA factorizations in picture

y

Past (already factorized) To be factorized

x1 x2 x3 . . . . . xn

R for y|x1,...,xn (LZ77)

7 / 34

slide-9
SLIDE 9

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA factorizations in picture

y

Past (already factorized) To be factorized

x1 x2 x3 . . . . . xn

R for y|x1,...,xn (LZ77) R for y|+x1,...,xn (Ziv-Merhav)

7 / 34

slide-10
SLIDE 10

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA symbols (and lengths)

Definition (SALZA symbols (s,l,z) and their lengths Ly≀x1,...,xn) By always finding the next longest subsequence in R , SALZA computes a factorization of y into m symbols (si,li,zi)1≤i≤m : y ≀ x1,...,xn = (s1,l1,z1)...(sm,lm,zm). Literals : s = y, l = 1 and z is the symbol in A that should be copied to the output buffer ; References : l > 1 is the length of a subsequence in R . SALZA symbol lengths are collected into :

Ly≀x1,...,xn = {li}1≤i≤m .

8 / 34

slide-11
SLIDE 11

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Product of SALZA factorizations

Definition (Product of SALZA factorizations) Let y1 and y2 two sequences being factorized, each with respective prior knowledge sequences x1,1,...,x1,n1 and x2,1,...,x2,n2. Let also : y1 ≀ x1,1,...,x1,n1 =

(s1,1,l1,1,z1,1)...(s1,m1,l1,m1,z1,m1),

and y2 ≀ x2,1,...,x2,n2 =

(s2,1,l2,1,z2,1)...(s2,m2,l2,m2,z2,m2).

We define their factorization product as the concatenation of their SALZA symbols : y1 ≀ x1,1,...,x1,n1 × y2 ≀ x2,1,...,x2,n2 =

(s1,1,l1,1,z1,1)...(s1,m1,l1,m1,z1,m1) (s2,1,l2,1,z2,1)...(s2,m2,l2,m2,z2,m2).

9 / 34

slide-12
SLIDE 12

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA joint and LZ77 factorizations

Definition (SALZA joint and LZ77 factorizations) By convention, set x≤0 = /

  • 0. The joint factorization of x1,...,xn ∈ A⋆ is

defined as the following product of factorizations : x1 ...· xn =

n

i=1

xi ≀ x≤i−1. Hence, x1| /

0 denotes the usual LZ77 factorization of x1. Moreover,

x1|+ /

0 denotes the succession of symbols forming x1.

On asymmetry Note that in general, x · y = y · x. On sequences, we are limited ( ?) to asymmetric relationships, see [Steudel et al., 2010].

10 / 34

slide-13
SLIDE 13

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Rationale

Noisy-stemming hypothesis [Cancedda et al., 2003] “Multiple word matching really does occur and is beneficial in forming discriminant, high weight features.” Sequence compressibility [Raskhnodnikova et al., 2013] Compressibility of a sequence using LZ77 is an inverse function of its

ℓ-th subword complexity, for small ℓ.

The higher the number of small subsequences to be compressed (noise), the lower the discriminative power using compressors. Morphological normalization in SALZA We shall penalize small subsequence lengths in the factorizations.

11 / 34

slide-14
SLIDE 14

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Admissible functions and noise level

Definition (Admissible function) For a sequence x, f : N⋆ → [0,1] is an admissible function iff :

1

f is monotonically increasing, and

2

∃ 0 < T < |x| ,∀ l ≥ T , f(l) = 1.

The value T acts as an internal threshold, above which all reference lengths are equally considered most meaningful. Definition (Noise level l⋆) An admissible function may be centered on a noise level l⋆ : f(l⋆) = 1 2.

12 / 34

slide-15
SLIDE 15

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Admissible functions

2 4 6 8 10 12 14 Longueur des références 0.0 0.2 0.4 0.6 0.8 1.0 Score

Sigmoïde, α = 3 Seuil Exponentielle, ǫ = 10−4 l0

2 4 6 8 10 12 14 Longueur des références 0.0 0.2 0.4 0.6 0.8 1.0 Score

Sigmoïde, α = 3 Seuil Exponentielle, ǫ = 10−4 l0 Name T f(l < T) f(1) f(l⋆)

count (✶)

1 1

threshold

l⋆ 1

linear

1+ 2(l⋆ − 1)

l−1 2(l⋆−1)

0.5

quadratic

  • 2(l⋆)2 − 1

l2−1 2((l⋆)2−1)

0.5

sigmoid αl⋆

1 1+e−l+αl⋆ 1 1+eαl⋆

0.5

exponential

log2+l⋆ logε log2+logε

exp

  • logε−(l − 1) log2+logε

l⋆−1

  • ε

0.5 13 / 34

slide-16
SLIDE 16

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA relative similarity

Definition (SALZA relative similarity) Given an admissible function f, and sequences y,x1,...,xn ∈ A∗, the relative SALZA similarity of y given x1,...,xn, denoted Sf(y ≀ x1,...,xn), is defined as : Sf(y ≀ x1,...,xn) = |y|−

l∈Ly≀x1,...,xn

(l − 1)f(l).

Lemma (Sf(y ≀ x1,...,xn) is bounded) 0 ≤ Sf(y ≀ x1,...,xn) ≤ |y|.

14 / 34

slide-17
SLIDE 17

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA joint similarities and self-similarity

Lemma (SALZA joint similarities and LZ77-based self-similarity) Given an admissible function f and sequences x1,...,xn ∈ A∗, the SALZA joint similarities are computed as : Sf(x1 ...· xn) =

n

i=1

Sf(xi|x≤i−1); S+

f (x1 ...· xn) = n

i=1

Sf(xi|

+x≤i−1).

The order of the x1,...,xn does matter. The notation for joint similarity gracefully degrades into that of the LZ77-based computation of the self-similarity Sf(x1) and S+

f (x1) = |x1|.

15 / 34

slide-18
SLIDE 18

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

SALZA conditional mutual similarities, asymmetric versions

Definition (SALZA conditional mutual similarities) Given an admissible function f and sequences x,y,z ∈ A∗, the SALZA conditional mutual similarities of x and y given z are defined as : If(x : y|z) = Sf(z · x)+ Sf(z · y)− Sf(z · x · y)− Sf(z); If(x : y|

+z) = S+

f (z · x)+ S+ f (z · y)− S+ f (z · x · y)−|z|.

If If(x : y ≀ z) = 0, x and y are said to be dissimilar given z. Lemma (Fast computation of If(x : y ≀ z)) If(x : y ≀ z) = Sf(y ≀ z)− Sf(y ≀ z,x).

16 / 34

slide-19
SLIDE 19

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Similarity as information

Lemma (Chain rule for SALZA similarity, asymmetric version) Given sequences x,y,z,t ∈ A⋆, If(x : y · z ≀ t) = If(x : y ≀ t)+ If(x : z ≀ t,y). Lemma (Data processing inequality for SALZA, asymmetric version) Given sequences x,y,z ∈ A⋆ : Sf(y ≀ z) ≤ 1 =

⇒ If(x : y ≀ z) = 0 = ⇒ If(x : y ≀ / 0) ≤ If(z : y ≀ / 0).

17 / 34

slide-20
SLIDE 20

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Relative SALZA complexity

Definition (Relative SALZA complexity) Given sequences y,x1,...,xn ∈ A∗, the relative SALZA complexity of y given x1,...,xn, is defined as S✶(y ≀ x1,...,xn). If I✶(x : y ≀ z) = 0, x and y are said independent given z. Lemma (SALZA relative complexity is non-increasing by conditioning) For any three sequences x,y,z ∈ A⋆, S✶(y ≀ x,z) ≤ S✶(y ≀ x).

18 / 34

slide-21
SLIDE 21

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

S✶ and Sf✶+ as information measures

Lattice of sequences [Steudel et al., 2010] We use the lexicographical order for ≤. Let x,y,z ∈ A⋆, A = {z,x} and B = {z,y}. The lattice operators are approximated as : A∧B = zxy and A∨B = z. Theorem (S✶ and S+

✶ are information measures in the sense of

[Steudel et al., 2010])

1

Normalization : Sf(/

0) = S+

f (/

0) = 0 ;

2

Monotonicity : x ≤ y =

⇒ Sf(x) ≤ Sf(y) and S+

f (x) ≤ S+ f (y) ;

3

Approximate submodularity : S✶(z · x)+ S✶(z · y) ≥ S✶(z · x · y)+ S✶(z), and : S+

✶ (z · x)+ S+ ✶ (z · y) ≥ S+ ✶ (z · x · y)+ S+ ✶ (z).

19 / 34

slide-22
SLIDE 22

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Datasets

Name Reference

  • Seq. size

Structured Real

|A|

Markov

markov4

16KiB Yes No 4 Markov

markov64

16KiB Yes No 64 Markov

markov256

16KiB Yes No 256 Uniform

random4

16KiB No No 4 Uniform

random64

16KiB No No 64 Uniform

random256

16KiB No No 256 DNA

dna

15KiB Yes Yes 4 Languages

rights

15KiB Yes Yes

∼64

20 / 34

slide-23
SLIDE 23

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA factorizations of sequences Taking similarity into account SALZA measures of similarity S✶ (LZ77) and S+ ✶ (Ziv-Merhav) complexities as information measures Symmetry and positivity of Sf in practice

Numerical behaviour of Sf

r a n d

  • m

4 d n a m a r k

  • v

4 r i g h t s r a n d

  • m

6 4 m a r k

  • v

6 4 r a n d

  • m

2 5 6 m a r k

  • v

2 5 6 random4 dna markov4 rights random64 markov64 random256 markov256 [%] 0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 r a n d

  • m

4 d n a m a r k

  • v

4 r i g h t s r a n d

  • m

6 4 m a r k

  • v

6 4 r a n d

  • m

2 5 6 m a r k

  • v

2 5 6 random4 dna markov4 rights random64 markov64 random256 markov256 [%] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Maximum departure from symmetry : No departure from positivity :

|If (x:y| / 0)−If (y:x| / 0)| |y|

Sf (y)−Sf (y|x)

|y|

.

21 / 34

slide-24
SLIDE 24

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Universal causality inference Universal classification

Plugging If into the PC algorithm [Spirtes et al., 1993]

FIGURE – Left : 8 successive drafts by J.-P . Toussaint.

22 / 34

slide-25
SLIDE 25

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Universal causality inference Universal classification

Plugging If into the PC algorithm [Spirtes et al., 1993]

fragment_1 fragment_2 fragment_3 fragment_4 fragment_5 fragment_6 fragment_7 fragment_8

FIGURE – Left : 8 successive drafts by J.-P . Toussaint. Right : output of :

salza --pcalg -i ~/datasets/toussaint --skel stable --dag

  • -alpha 0.01 | neato -Tpdf > toussaint.pdf

22 / 34

slide-26
SLIDE 26

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Universal causality inference Universal classification

Definition (NSDf ) Given an admissible function f, and two sequences x,y ∈ A+, the normalized SALZA semi-distance, denoted NSDf , is defined as : NSDf(x,y) = max

  • Sf(x|+y)− 1

|x| , Sf(y|+x)− 1 |y|

  • .

Theorem NSDf is a normalized semi-distance. In practice We did not witness any violation of the triangle inequality during our tests.

23 / 34

slide-27
SLIDE 27

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Universal causality inference Universal classification

Comparison with NCD/xz [Cilibrasi and Vitányi, 2005]

0.0 0.1 0.2 0.3 0.4 0.5 0.6 branch length 10 20 30 40 languages irishGaelic scottishGaelic wallon

  • ccitanAuvergnat

uzbek estonian finnish hungarian turkish icelandic faroese welsh breton basque latvian lithuanian sorbian polish slovak czech slovenian croatian serbian bosnian german luxembourgish frisan Dutch afrikaans swedish norwegianNynorsk norwegianBokmal danish albanian maltese corsican sardinian romanian english rhaetoRomance italian friulian french

  • ccitan

catalan protuguese asturian galician spanish

(a) NCD/xz.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 branch length 10 20 30 40 languages welsh scottishGaelic irishGaelic maltese lithuanian latvian sorbian polish slovak czech slovenian croatian serbian bosnian albanian wallon

  • ccitanAuvergnat

romanian english corsican italian friulian rhaetoRomance sardinian french

  • ccitan

catalan protuguese asturian spanish galician hungarian uzbek turkish icelandic faroese finnish estonian basque breton luxembourgish german frisan dutch afrikaans swedish norwegianNynorsk norwegianBokmal danish

FIGURE – rights dataset : Phenetic representation of various human writing

  • systems. Texts are translations of the Universal Declaration of Human Rights.

24 / 34

slide-28
SLIDE 28

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Universal causality inference Universal classification

Comparison with NCD/xz [Cilibrasi and Vitányi, 2005]

0.0 0.1 0.2 0.3 0.4 0.5 0.6 branch length 10 20 30 40 languages irishGaelic scottishGaelic wallon

  • ccitanAuvergnat

uzbek estonian finnish hungarian turkish icelandic faroese welsh breton basque latvian lithuanian sorbian polish slovak czech slovenian croatian serbian bosnian german luxembourgish frisan Dutch afrikaans swedish norwegianNynorsk norwegianBokmal danish albanian maltese corsican sardinian romanian english rhaetoRomance italian friulian french

  • ccitan

catalan protuguese asturian galician spanish

(a) NCD/xz.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 branch length 10 20 30 40 languages welsh scottishGaelic irishGaelic maltese lithuanian latvian sorbian polish slovak czech slovenian croatian serbian bosnian albanian wallon

  • ccitanAuvergnat

romanian english corsican italian friulian rhaetoRomance sardinian french

  • ccitan

catalan protuguese asturian spanish galician hungarian uzbek turkish icelandic faroese finnish estonian basque breton luxembourgish german frisan dutch afrikaans swedish norwegianNynorsk norwegianBokmal danish

(b) NSD, l⋆ = 2.12.

FIGURE – rights dataset : Phenetic representation of various human writing

  • systems. Texts are translations of the Universal Declaration of Human Rights.

24 / 34

slide-29
SLIDE 29

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Index structure, A ⊂ [0;255]

. . . . . . . . . . . .

a b c a b c a b c

. . . . . . . . . 64K, 2-byte indexing

NULL ’b’ ’a’ ’a’ ’a’

. . . . . . 256, 1-byte indexing . . . . . .

uint32_t NULL

. . . . . .

uint32_t uint32_t uint32_t ’c’

Amortized #calls to calloc

x

25 / 34

slide-30
SLIDE 30

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Step 1 : Block index building

dict_block dict_block |x|%dict_block

. . .

. . . . . . . . . . . . a b c a b c a b c . . . . . . . . . 64K, 2-byte indexing NULL ’b’ ’a’ ’a’ ’a’ . . . . . . 256, 1-byte indexing . . . . . . uint32_t NULL . . . . . . uint32_t uint32_t uint32_t ’c’ Amortized #calls to calloc . . . . . . . . . . . . a b c a b c a b c . . . . . . . . . 64K, 2-byte indexing NULL ’b’ ’a’ ’a’ ’a’ . . . . . . 256, 1-byte indexing . . . . . . uint32_t NULL . . . . . . uint32_t uint32_t uint32_t ’c’ Amortized #calls to calloc

. . .

. . . . . . . . . . . . a b c a b c a b c . . . . . . . . . 64K, 2-byte indexing NULL ’b’ ’a’ ’a’ ’a’ . . . . . . 256, 1-byte indexing . . . . . . uint32_t NULL . . . . . . uint32_t uint32_t uint32_t ’c’ Amortized #calls to calloc

x

26 / 34

slide-31
SLIDE 31

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Step 2 : Block index merging

64K/ncpus 64K/ncpus

. . .

64K/ncpus

dict_block dict_block |x|%dict_block

. . .

. . . . . . . . . . . . a b c a b c a b c . . . . . . . . . 64K, 2-byte indexing NULL ’b’ ’a’ ’a’ ’a’ . . . . . . 256, 1-byte indexing . . . . . . uint32_t NULL . . . . . . uint32_t uint32_t uint32_t ’c’ Amortized #calls to calloc . . . . . . . . . . . . a b c a b c a b c . . . . . . . . . 64K, 2-byte indexing NULL ’b’ ’a’ ’a’ ’a’ . . . . . . 256, 1-byte indexing . . . . . . uint32_t NULL . . . . . . uint32_t uint32_t uint32_t ’c’ Amortized #calls to calloc

. . .

. . . . . . . . . . . . a b c a b c a b c . . . . . . . . . 64K, 2-byte indexing NULL ’b’ ’a’ ’a’ ’a’ . . . . . . 256, 1-byte indexing . . . . . . uint32_t NULL . . . . . . uint32_t uint32_t uint32_t ’c’ Amortized #calls to calloc

x

26 / 34

slide-32
SLIDE 32

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Index structure summary

Theory

1

Allows to also index small (length-2) subsequences :

Contrary to the structure suggested in RFC1951 (gzip/DEFLATE) ; Important for small l⋆ ;

2

Space complexity varies with sequence complexity. Practice

1

Full range of subsequence lengths, up to 4 giga-symbols ;

2

Fast, parallel, lock-free construction :

Geometric growing scheme for index vectors ; Scales (nearly) linearly when not limited by memory bandwidth !

3

Cache-friendly during factorization.

27 / 34

slide-33
SLIDE 33

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Step 1 : Block factorization

x x

fact_block fact_block fact_block |x|%fact_block

28 / 34

slide-34
SLIDE 34

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Step 1 : Block factorization (overflow allowed)

x x

fact_block fact_block fact_block |x|%fact_block

28 / 34

slide-35
SLIDE 35

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Step 2 : Stitching back to the true sequential factorization

x x Sync!

fact_block fact_block fact_block |x|%fact_block

28 / 34

slide-36
SLIDE 36

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Step 2 : Stitching back to the true sequential factorization

x x Sync! Sync!

fact_block fact_block fact_block |x|%fact_block

28 / 34

slide-37
SLIDE 37

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Step 2 : Stitching back to the true sequential factorization

x x Sync! Sync! Sync!

fact_block fact_block fact_block |x|%fact_block

28 / 34

slide-38
SLIDE 38

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Factorization summary

Theory

1

Guaranteed to resort to the true sequential factorization ;

Slower but true to the theory !

2

None of the usual LZ tricks found in compressors...

e.g. no lazy match, no minimum ref. length, etc.

Practice

1

Much of the work done in parallel ;

2

Very few additional calls for stitching at blocks boundary ;

3

Scales (nearly) linearly when not limited by memory bandwidth !

→ ~4 threads will eat up most of the bandwidth.

29 / 34

slide-39
SLIDE 39

General information measures SALZA similarity as information Applications of SALZA Parallel implementation Index/search structure Factorization

Sample run (self-similarity assessment)

cayre@code:~$ cat /proc/cpuinfo | grep name | head -n 1 model name : AMD Ryzen 7 1700 Eight-Core Processor cayre@code:~$ ls -l datasets/books/montesquieu_esprit_des_lois.txt

  • rw-r-r- 1 cayre login 2167031 Nov 8 15:36 datasets/books/montesquieu_esprit_des_lois.txt

cayre@code:~$ time gzip datasets/books/montesquieu_esprit_des_lois.txt real 0m0.175s cayre@code:~$ gunzip datasets/books/montesquieu_esprit_des_lois.txt.gz cayre@code:~$ time xz datasets/books/montesquieu_esprit_des_lois.txt real 0m1.050s cayre@code:~$ unxz datasets/books/montesquieu_esprit_des_lois.txt.xz cayre@code:~$ time salza --cpus 1 -s -f "\"|\"" -i datasets/books/montesquieu_esprit_des_lois.txt 8.669940e+05 real 0m2.135s cayre@code:~$ time salza --cpus 8 -s -f "\"|\"" -i datasets/books/montesquieu_esprit_des_lois.txt 8.669940e+05 real 0m0.572s cayre@code:~$

Multithread speedup in SALZA on 8 cores : x3.73

30 / 34

slide-40
SLIDE 40

General information measures SALZA similarity as information Applications of SALZA Parallel implementation

Open issues / what’s next ?

Open issues

1

No unit ! (#ops ?)

2

Used normalized If in the PC algorithm instead of a p-value !

Our actual "independence" test : 0 ≤ If (x:y≀z1,...,zn)

|y|

< α ≤ 1

3

Include search for symmetries ?

4

No plan for |A| > 256 ! What’s next

1

Quantization as mapping onto sequences ;

→ back to the good old days of 8-bit signals !

2

Time-similarity representation of signals.

31 / 34

slide-41
SLIDE 41

General information measures SALZA similarity as information Applications of SALZA Parallel implementation

Acknowledgements Drafts from La Réticence by Jean-Philippe Toussaint are courtesy

  • Prof. Thomas Lebarbé, TGIR HumaNum.

Preprint, datasets and C code (GNU Affero GPL v3 + custom licenses) https://forge.uvolante.org/stable/salza-driver/wikis/home

(Also contains link to Debian/Ubuntu amd64 repo for binaries.)

State of the code Shared, undocumented C code mostly sequential ; Multithreaded code to be released when documented.

32 / 34

slide-42
SLIDE 42

General information measures SALZA similarity as information Applications of SALZA Parallel implementation

References I

Cancedda, N., Gaussier, E., Goutte, C., and Renders, J.-M. (2003). Word-Sequence Kernels. Journal of Machine Learning Research, 3 :1059–1082. Cilibrasi, R. and Vitányi, P . M. (2005). Clustering by Compression. IEEE Transactions on Information Theory, 51 :1523–1545. Raskhnodnikova, S., Ron, D., Rubinfeld, R., and Smith, A. (2013). Sublinear Algorithms for Approximating String Compressibility. Algorithmica, 65 :685–709. Spirtes, P ., Glymour, C., and Scheines, R. (1993). Causation, Prediction, and Search. Springer.

33 / 34

slide-43
SLIDE 43

General information measures SALZA similarity as information Applications of SALZA Parallel implementation

References II

Steudel, B., Janzing, D., and Schölkopf, B. (2010). Causal Markov Condition for Submodular Information Measures. In Proceedings of the 23rd Annual Conference on Learning Theory, pages 464–476, Madison, WI, USA. OmniPress.

34 / 34