SALZA: Algorithmic Information Theory and Universal Classification - PowerPoint PPT Presentation

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA: Algorithmic Information Theory and Universal Classification for Sequences SeqBio 2018, Rouen, France François Cayre, Nicolas Le Bihan and Marion Revolle GIPSA-Lab | DIS | CICS November 19 th , 2018 1 / 34

General information measures SALZA similarity as information Definitions Applications of SALZA Examples Parallel implementation Axioms for measuring information Definition (General information measure [Steudel et al., 2010]) Let X be a set of discrete-valued r.v., Ω = 2 X be the set of subsets, (Ω , ∧ , ∨ ) be a finite lattice and 0 be the meet of all elements. R : Ω → R is an information measure if it satisfies : Normalization : R ( 0 ) = 0 ; 1 Monotonicity : ∀ s , t ∈ Ω , s ≤ t = ⇒ R ( s ) ≤ R ( t ) ; 2 Submodularity : ∀ s , t ∈ Ω , R ( s )+ R ( t ) ≥ R ( s ∨ t )+ R ( s ∧ t ) . 3 Definition (Conditional mutual information [Steudel et al., 2010]) ∀ s , t , u ∈ Ω , I ( s : t | u ) = R ( s ∨ u )+ R ( t ∨ u ) − R ( s ∨ t ∨ u ) − R ( u ) . s and t are said to be independent given u if I ( s : t | u ) = 0. 2 / 34

General information measures SALZA similarity as information Definitions Applications of SALZA Examples Parallel implementation Deriving information theory Lemma (Non-negativity of mutual information and conditioning [Steudel et al., 2010]) ∀ s , t , u ∈ Ω , the following hold : 0 ≤ I ( s : t | u ) ; 1 0 ≤ I ( s | t , u ) ≤ I ( s | t ) . 2 Lemma (Chain rule [Steudel et al., 2010]) ∀ s , t , u , x ∈ Ω , I ( s : t ∨ u | x ) = I ( s : t | x )+ I ( s : u | t , x ) . Lemma (Data processing inequality [Steudel et al., 2010]) ∀ s , t , x ∈ Ω , R ( s | t ) = 0 = ⇒ I ( s : x | t ) = 0 = ⇒ I ( s : x ) ≤ I ( t : x ) . 3 / 34

General information measures SALZA similarity as information Definitions Applications of SALZA Examples Parallel implementation Examples of information measures [Steudel et al., 2010] Common examples Shannon entropy of r.v. ; Kolmogorov complexity of binary strings ; Period length of time series ; Size of vocabulary in a text. Complexity/Compression-based Lempel-Ziv complexity (LZ76) ; Grammar-based compression ; LZ77 ? Ziv-Merhav ? (now, that’s a cliffhanger !) 4 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice Sequences Definition (Sequences) A sequence x is defined as a finite succession of symbols drawn from a countable alphabet A . Let | x | be the length of the sequence x , the empty sequence is / 0 . A + is the set of all non-empty sequences and A ⋆ = / 0 ∪ A + . In a set of n sequences x 1 ,..., x n , the first k sequences are denoted by x ≤ k and x ≤ 0 = / 0 . 5 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA factorizations Definition (Prior knowledge R and factorizations) Given sequences y , x 1 ,..., x n ∈ A ⋆ , the notation y ≀ x 1 ,..., x n stands for the generic case and denotes any of the following canonical factorizations : y | x 1 ,..., x n : R is the past of y and the entirety of x 1 ,..., x n 1 → LZ77-based factorization ; y | + x 1 ,..., x n : R is the entirety of x 1 ,..., x n 2 → Ziv-Merhav-based factorization. 6 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA factorizations in picture y Past (already factorized) To be factorized x 1 x 2 x 3 . . . . . x n 7 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA factorizations in picture R for y | x 1 ,..., x n (LZ77) y Past (already factorized) To be factorized x 1 x 2 x 3 . . . . . x n 7 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA factorizations in picture R for y | x 1 ,..., x n (LZ77) R for y | + x 1 ,..., x n (Ziv-Merhav) y Past (already factorized) To be factorized x 1 x 2 x 3 . . . . . x n 7 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA symbols (and lengths) Definition (SALZA symbols ( s , l , z ) and their lengths L y ≀ x 1 ,..., x n ) By always finding the next longest subsequence in R , SALZA computes a factorization of y into m symbols ( s i , l i , z i ) 1 ≤ i ≤ m : y ≀ x 1 ,..., x n = ( s 1 , l 1 , z 1 ) ... ( s m , l m , z m ) . Literals : s = y , l = 1 and z is the symbol in A that should be copied to the output buffer ; References : l > 1 is the length of a subsequence in R . SALZA symbol lengths are collected into : L y ≀ x 1 ,..., x n = { l i } 1 ≤ i ≤ m . 8 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice Product of SALZA factorizations Definition (Product of SALZA factorizations) Let y 1 and y 2 two sequences being factorized, each with respective prior knowledge sequences x 1 , 1 ,..., x 1 , n 1 and x 2 , 1 ,..., x 2 , n 2 . Let also : y 1 ≀ x 1 , 1 ,..., x 1 , n 1 = ( s 1 , 1 , l 1 , 1 , z 1 , 1 ) ... ( s 1 , m 1 , l 1 , m 1 , z 1 , m 1 ) , and y 2 ≀ x 2 , 1 ,..., x 2 , n 2 = ( s 2 , 1 , l 2 , 1 , z 2 , 1 ) ... ( s 2 , m 2 , l 2 , m 2 , z 2 , m 2 ) . We define their factorization product as the concatenation of their SALZA symbols : y 1 ≀ x 1 , 1 ,..., x 1 , n 1 × y 2 ≀ x 2 , 1 ,..., x 2 , n 2 = ( s 1 , 1 , l 1 , 1 , z 1 , 1 ) ... ( s 1 , m 1 , l 1 , m 1 , z 1 , m 1 ) ( s 2 , 1 , l 2 , 1 , z 2 , 1 ) ... ( s 2 , m 2 , l 2 , m 2 , z 2 , m 2 ) . 9 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice SALZA joint and LZ77 factorizations Definition (SALZA joint and LZ77 factorizations) 0 . The joint factorization of x 1 ,..., x n ∈ A ⋆ is By convention, set x ≤ 0 = / defined as the following product of factorizations : n ∏ x 1 ... · x n = x i ≀ x ≤ i − 1 . i = 1 Hence, x 1 | / 0 denotes the usual LZ77 factorization of x 1 . Moreover, x 1 | + / 0 denotes the succession of symbols forming x 1 . On asymmetry Note that in general, x · y � = y · x . On sequences, we are limited ( ?) to asymmetric relationships, see [Steudel et al., 2010]. 10 / 34

SALZA factorizations of sequences General information measures Taking similarity into account SALZA similarity as information SALZA measures of similarity Applications of SALZA S ✶ (LZ77) and S + ✶ (Ziv-Merhav) complexities as information measures Parallel implementation Symmetry and positivity of S f in practice Rationale Noisy-stemming hypothesis [Cancedda et al., 2003] “Multiple word matching really does occur and is beneficial in forming discriminant, high weight features.” Sequence compressibility [Raskhnodnikova et al., 2013] Compressibility of a sequence using LZ77 is an inverse function of its ℓ -th subword complexity, for small ℓ . The higher the number of small subsequences to be compressed (noise), the lower the discriminative power using compressors. Morphological normalization in SALZA We shall penalize small subsequence lengths in the factorizations. 11 / 34

SALZA: Algorithmic Information Theory and Universal Classification - PowerPoint PPT Presentation

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA: Algorithmic Information Theory and Universal Classification for Sequences SeqBio 2018, Rouen, France Franois Cayre, Nicolas Le

Algorithmic Complexity Algorithmic Complexity "Algorithmic Complexity", also called

Dualities and Dichotomies in Algorithmic Information Theory Jan Reimann Pennsylvania State

Universal Credit Universal Credit Universal Credit is for working-age people aged over 18 and

Adding Aerosol Cans to the Universal Waste Regulations Where does Universal Waste fit? HAZARDOUS

UNIVERSAL ROBOTS RUC 2018 Universal Robots - Evolving the future UNIVERSAL ROBOTS SET THE

Tech Day: Universal Acceptance Mark van rek Universal Acceptance Todays Objectives

Basic Algorithms in Number Theory Francesco Pappalardi Algorithmic Complexity & more. July 19

Algorithmic Decision Theory and Smart Cities Fred Roberts Rutgers University 1 Algorithmic

Treewidth reduction and algorithmic applications Treewidth reduction and algorithmic applications

Algorithmic Aspects of Example: How to . . . Algorithmic Aspects of . . . Analysis, Prediction,

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Is there an Elegant Universal Theory of Prediction? Shane Legg Dalle Molle Institute for

Universal Acceptance Quick Guide What Does Universal Acceptance Mean? ACCEPT Universal

North West Landlords Forum Universal Credit June 2014 Universal Credit Current position

V-PLC9000 Product Series Veesta Universal PLC & Veesta Universal PLC & Universal PLC

Algorithmic Aspects of WQO (Well-Quasi-Ordering) Theory Part II: Algorithmic Applications of WQOs

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

String indexing in the Word RAM model, part 4 Pawe Gawrychowski University of Wrocaw &

Compression: Other Lossless Compression Algorithms Greg Plaxton Theory in Programming Practice,

Thing Descriptions TD Serialization Carsten Bormann 2017-07-11 1 Objectives

Inmarsat BGAN Vlad Galu <vlad.galu@inmarsat.com> Oct 9 th 2012 What is BGAN? Worldwide

Characterizing Mote Performance: A Vector-Based Methodology Martin Leopold, Marcus Chang, and

My journey on SMBGhost Angelboy angelboy@chroot.org @scwuaptx Whoami Angelboy

Compact Data Strutures Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd

SALZA: Algorithmic Information Theory and Universal Classification - PowerPoint PPT Presentation

General information measures SALZA similarity as information Applications of SALZA Parallel implementation SALZA: Algorithmic Information Theory and Universal Classification for Sequences SeqBio 2018, Rouen, France Franois Cayre, Nicolas Le

Algorithmic Complexity Algorithmic Complexity &quot;Algorithmic Complexity&quot;, also called

Dualities and Dichotomies in Algorithmic Information Theory Jan Reimann Pennsylvania State

Universal Credit Universal Credit Universal Credit is for working-age people aged over 18 and

Adding Aerosol Cans to the Universal Waste Regulations Where does Universal Waste fit? HAZARDOUS

UNIVERSAL ROBOTS RUC 2018 Universal Robots - Evolving the future UNIVERSAL ROBOTS SET THE

Tech Day: Universal Acceptance Mark van rek Universal Acceptance Todays Objectives

Basic Algorithms in Number Theory Francesco Pappalardi Algorithmic Complexity &amp; more. July 19

Algorithmic Decision Theory and Smart Cities Fred Roberts Rutgers University 1 Algorithmic

Treewidth reduction and algorithmic applications Treewidth reduction and algorithmic applications

Algorithmic Aspects of Example: How to . . . Algorithmic Aspects of . . . Analysis, Prediction,

Algorithmic Meta-Theorems for Restrictions of Treewidth Michael Lampis Computer Science Dept.

Is there an Elegant Universal Theory of Prediction? Shane Legg Dalle Molle Institute for

Universal Acceptance Quick Guide What Does Universal Acceptance Mean? ACCEPT Universal

North West Landlords Forum Universal Credit June 2014 Universal Credit Current position

V-PLC9000 Product Series Veesta Universal PLC &amp; Veesta Universal PLC &amp; Universal PLC

Algorithmic Aspects of WQO (Well-Quasi-Ordering) Theory Part II: Algorithmic Applications of WQOs

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

String indexing in the Word RAM model, part 4 Pawe Gawrychowski University of Wrocaw &amp;

Compression: Other Lossless Compression Algorithms Greg Plaxton Theory in Programming Practice,

Thing Descriptions TD Serialization Carsten Bormann 2017-07-11 1 Objectives

Inmarsat BGAN Vlad Galu &lt;vlad.galu@inmarsat.com&gt; Oct 9 th 2012 What is BGAN? Worldwide

Characterizing Mote Performance: A Vector-Based Methodology Martin Leopold, Marcus Chang, and

My journey on SMBGhost Angelboy angelboy@chroot.org @scwuaptx Whoami Angelboy

Compact Data Strutures Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd

Algorithmic Complexity Algorithmic Complexity "Algorithmic Complexity", also called

Basic Algorithms in Number Theory Francesco Pappalardi Algorithmic Complexity & more. July 19

V-PLC9000 Product Series Veesta Universal PLC & Veesta Universal PLC & Universal PLC

String indexing in the Word RAM model, part 4 Pawe Gawrychowski University of Wrocaw &

Inmarsat BGAN Vlad Galu <vlad.galu@inmarsat.com> Oct 9 th 2012 What is BGAN? Worldwide