SLIDE 1 A bottom-up efficient algorithm learning substitutable languages from positive examples
Fran¸ cois Coste, Gaelle Garet, Jacques Nicolas ICGI, Kyoto, September 17, 2014
ReG.*iS ICGI’14 1 / 26
SLIDE 2 Motivation
Distributional Hypothesis (words that occur in the same contexts tend to have similar meanings [Harris, 1954]. ”a word is characterized by the company it keeps” [Firth, 1957]) has been for long an influential idea in Linguistics : Part of the language acquisition discussion. . . Base of Statistical Semantics Unsupervised POS parsing (Constituent-Context Models [Klein &
Manning, 2001]. . . )
Learning expressive grammars from positive examples only
Heuristics : EMILE [Adriaans, 1992 ; Adriaans and Vervoort,
2002)] , ABL [van Zaanen, 2002], ADIOS [Solan et al., 2005]. . .
Characterizable inference of substitutable languages : [Clark &
Eyraud 2007, Yoshinaka 2008, . . . ]
and [CGN2012] for proteins !
ReG.*iS ICGI’14 2 / 26
SLIDE 3 Substitutable Languages
L is substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ∧ x2y1z2 ∈ L ⇒ x2y2z2 ∈ L
ReG.*iS ICGI’14 3 / 26
SLIDE 4 Substitutable Languages
L is substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L)
ReG.*iS ICGI’14 3 / 26
SLIDE 5 Substitutable Languages
L is substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2]
ReG.*iS ICGI’14 3 / 26
SLIDE 6 Substitutable Languages
L is substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-substitutable [Yoshinaka, 2008] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)
ReG.*iS ICGI’14 3 / 26
SLIDE 7 Substitutable Languages
L is substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) L is k, l-substitutable [Yoshinaka, 2008] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)
ReG.*iS ICGI’14 3 / 26
SLIDE 8 Substitutable Languages
L is substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) L is k, l-substitutable [Yoshinaka, 2008] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) L is k, l-local-context substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)
ReG.*iS ICGI’14 3 / 26
SLIDE 9 Substitutable Languages
L is substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) L is k, l-context-substitutable [Yoshinaka, 2008] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) L is k, l-local-context substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)
ReG.*iS ICGI’14 3 / 26
SLIDE 10 Substitutable Languages
L is zero-substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) L is k, l-context-substitutable [Yoshinaka, 2008] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) L is k, l-local-context substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)
ReG.*iS ICGI’14 3 / 26
SLIDE 11 Substitutable Languages
L is zero-substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-context-substitutable [Yoshinaka, 2008] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) L is k, l-local-context substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)
ReG.*iS ICGI’14 3 / 26
SLIDE 12 Substitutable Languages
L is zero-substitutable [Clark & Eyraud, 2007] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:
x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-context-substitutable [Yoshinaka, 2008] if :
x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) i.e. [uy1v] = [uy2v] L is k, l-local-context substitutable [CGN, 2012] if :
x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ
x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) i.e. [uy1v] = [uy2v]
ReG.*iS ICGI’14 3 / 26
SLIDE 13 “Weak-implies-Strong” Generalization
Let K be the following set of strings :
Major General was here yesterday morning. Major General went here yesterday morning. Major General will be there tomorrow morning. He will be gone tomorrow evening.
Strings to add to get a 1, 1-local substitutable language :
ReG.*iS ICGI’14 4 / 26
SLIDE 14 “Weak-implies-Strong” Generalization
Let K be the following set of strings :
Major General was here yesterday morning. Major General went here yesterday morning. Major General will be there tomorrow morning. He will be gone tomorrow evening.
Strings to add to get a 1, 1-local substitutable language :
Major General will be gone tomorrow morning. He will be there tomorrow evening.
ReG.*iS ICGI’14 4 / 26
SLIDE 15 “Weak-implies-Strong” Generalization
Let K be the following set of strings :
Major General was here yesterday morning. Major General went here yesterday morning. Major General will be there tomorrow morning. He will be gone tomorrow evening.
Strings to add to get a 1, 1-local substitutable language :
Major General will be gone tomorrow morning. He will be there tomorrow evening. He will be gone tomorrow morning. Major General will be there tomorrow evening. Major General was here yesterday evening. Major General went here yesterday evening.
ReG.*iS ICGI’14 4 / 26
SLIDE 16 “Weak-implies-Strong” Generalization
Let K be the following set of strings :
Major General was here yesterday morning. Major General went here yesterday morning. Major General will be there tomorrow morning. He will be gone tomorrow evening.
Strings to add to get a 1, 1-local substitutable language :
Major General will be gone tomorrow morning. He will be there tomorrow evening. He will be gone tomorrow morning. Major General will be there tomorrow evening. Major General was here yesterday evening. Major General went here yesterday evening. Major General will be gone tomorrow evening. He will be there tomorrow morning. . . .
ReG.*iS ICGI’14 4 / 26
SLIDE 17 Learning k, l-local substitutable languages
Adaptation of SGL algorithm [Clark & Eyraud, 2007] :
ReG.*iS ICGI’14 5 / 26
SLIDE 18
SGLLS
(Substitution Graph Learner for Local Substitutable languages)
Input: Set of sequences K on alphabet Σ, int k, int l Output: Grammar G = Σ, NK, SK, PK /* Partition Sub(K) in Local Substitutability classes */ CK ← LS classes(K, k, l) /* ∀y ∈ Sub(K), CK(y) = C ∈ CK : y ∈ C */ /* Build grammar */ NK ← ∅, PK ← ∅ for C ∈ CK do /* A non-terminal for each substitutability class */ NK ← NK ∪ {C} /* Productions rules for each substring in the class */ for y ∈ C do if |y| > 1 then /* Branching rules: a ’CNF’ rule for each split */ for y1 ∈ Σ+, y2 ∈ Σ+ : y1y2 = y do PK ← PK ∪ {C → CK(y1)CK(y2)} else /* Terminal rule */ PK ← PK ∪ {C → y} SK ← CK(k): k ∈ K return Σ, NK, SK, PK
SLIDE 19 Local substitutability classes
LS classes()
Input: Set of sequences K on alphabet Σ, int k, int l Output: k, l local substitutability classes on K /* Build substitutability graph on substrings */ V ← {y ∈ Σ+ : y ∈ Sub(K)} E ← {{y1, y2} ∈ V × V : uy1v ∈ Sub(K), uy2v ∈ Sub(K), y1 = y2, u ∈ Σk, v ∈ Σl} /* Return connected components of graph */ return Connected components(V , E) where Sub(K) denotes the set of substrings of K
This is the unique difference with SGL ! To learn other substitutable classes, change E and V initialization.
ReG.*iS ICGI’14 7 / 26
SLIDE 20 Substitutability graph
’be gone tomorrow evening’ ’be’ ’Major’ ’General’ ’gone tomorrow’ ’be gone tomorrow’ ’was’ ’General Major was here yesterday morning’ ’He will be’ ’will be gone tomorrow’ ’General Major went’ ’General Major was here’ ’was here’ ’will be there tomorrow’ ’there’ ’went here yesterday’ ’morning’ ’Major will be there tomorrow’ ’Major went here yesterday’ ’Major went here’ ’will’ ’General Major’ ’Major will’ ’Major will be’ ’General Major went here yesterday morning’ ’Major went here yesterday morning’ ’be there tomorrow morning’ ’He will be gone tomorrow evening’ ’evening’ ’General Major will be there tomorrow’ ’General Major was here yesterday’ ’here yesterday’ ’here’ ’gone tomorrow evening’ ’Major will be there tomorrow morning’ ’tomorrow morning’ ’went’ ’Major was here yesterday morning’ ’tomorrow evening’ ’Major was here yesterday’ ’gone’ ’General Major will be’ ’here yesterday morning’ ’He will be gone tomorrow’ ’will be gone’ ’yesterday morning’ ’was here yesterday morning’ ’Major was here’ ’there tomorrow’ ’General Major will’ ’General Major went here yesterday’ ’He will be gone’ ’be gone’ ’General Major went here’ ’went here’ ’He will’ ’Major went’ ’Major was’ ’He’ ’there tomorrow morning’ ’went here yesterday morning’ ’General Major will be there tomorrow morning’ ’General Major will be there’ ’tomorrow’ ’will be there’ ’will be gone tomorrow evening’ ’be there’ ’was here yesterday’ ’will be there tomorrow morning’ ’will be’ ’General Major was’ ’be there tomorrow’ ’Major will be there’ ’yesterday’
source: Gaelle Garet
ReG.*iS ICGI’14 8 / 26
SLIDE 21
SGLLS
(Substitution Graph Learner for Local Substitutable languages)
Input: Set of sequences K on alphabet Σ, int k, int l Output: Grammar G = Σ, NK, SK, PK /* Partition Sub(K) in Local Substitutability classes */ CK ← LS classes(K, k, l) /* ∀y ∈ Sub(K), CK(y) = C ∈ CK : y ∈ C */ /* Build grammar */ NK ← ∅, PK ← ∅ for C ∈ CK do /* A non-terminal for each substitutability class */ NK ← NK ∪ {C} /* Productions rules for each substring in the class */ for y ∈ C do if |y| > 1 then /* Branching rules: a ’CNF’ rule for each split */ for y1 ∈ Σ+, y2 ∈ Σ+ : y1y2 = y do PK ← PK ∪ {C → CK(y1)CK(y2)} else /* Terminal rule */ PK ← PK ∪ {C → y} SK ← CK(k): k ∈ K return Σ, NK, SK, PK
SLIDE 22 Resulting grammar :
X47 →′ yesterday′ X46 → X3X43|X11X19|X23X13 X45 → X20X2 X44 → X20X1|X45X29|X34X16|X9X15 X43 → X20X19|X45X13 X42 →′ tomorrow′ X41 → X42X15 X40 → X30X46|X21X43|X39X19|X25X13|X21X34|X8X13 N0 → X30X24|X21X31|X10X32|X36X17|X26X15|X39X1 |X25X29|X40X41|X21X44|X8X29|X40X16|X33X15 X29 → X13X16|X4X15|X13X41|X38X15 X28 → X27X47 X25 → X30X23|X21X45|X39X2 X24 → X3X31|X18X32|X37X17|X14X15|X11X1|X23X29|X46X41 X27 →′ here′ X26 → X30X14|X21X12|X10X28|X36X47|X39X35|X25X38|X40X42 X21 →′ He′|X30X3 X20 →′ will′ X23 → X3X45|X11X2 X22 → X6X27 X8 → X21X45|X39X2 X9 → X20X5|X45X4|X34X42 X2 →′ be′ X3 →′ Major′ X1 → X2X29|X19X16|X5X15|X19X41|X35X15 X6 →′ was′|′went′ X4 → X13X42 X5 → X2X4|X19X42 X32 → X27X17|X28X15 X33 → X21X9|X39X5|X8X4|X40X42 X30 → General X31 → X6X32|X22X17|X12X15|X20X1|X45X29|X43X41 X36 → X30X37|X21X22|X10X27 X37 → X3X22|X18X27 X34 → X20X19|X45X13 X35 → X2X38|X19X42 X38 → X13X42 X39 → X30X11|X21X20 X18 → X3X6 X19 → X2X13 X10 → X30X18|X21X6 X11 → X3X20 X12 → X6X28|X22X47|X20X35|X45X38|X43X42 X13 →′ there′|′gone′ X14 → X3X12|X18X28|X37X47|X11X35|X23X38|X46X42 X15 →′ morning′|′evening′ X16 → X42X15 X17 → X47X15 source : GaelleGaret
SLIDE 23 Limitations
A lot of non-terminals and rules A lot of redundancy and ambiguity ⇒ Parsing time and learning time problems 1 (+ Illegible)
- 1. About a day for one experiment on a simple set of proteins
- F. Coste (Dyliss, Inria)
ReG.*iS ICGI’14 11 / 26
SLIDE 24 Limitations
A lot of non-terminals and rules A lot of redundancy and ambiguity ⇒ Parsing time and learning time problems 1 (+ Illegible)
A solution
Reduce the grammar before parsing
- 1. About a day for one experiment on a simple set of proteins
- F. Coste (Dyliss, Inria)
ReG.*iS ICGI’14 11 / 26
SLIDE 25 Limitations
A lot of non-terminals and rules A lot of redundancy and ambiguity ⇒ Parsing time and learning time problems 1 (+ Illegible)
A solution
Reduce the grammar during the inference
- 1. About a day for one experiment on a simple set of proteins
- F. Coste (Dyliss, Inria)
ReG.*iS ICGI’14 11 / 26
SLIDE 26 Ideas
1 Avoid unnecessary derivations by removing non-terminals with a deterministic derivation : If A is the left-hand-side of only one rule A → α then delete them and replace all B → . . . A . . . by B → . . . α . . .. 2 Reduce the right-hand-sides of the production rules : For each N → . . . β . . ., if there exists α : |α| < |β| ∧ α ⇒ β, then replace N → . . . β . . . by N → . . . α . . . (Teaser : it ensures also maximal generalization !)
ReG.*iS ICGI’14 12 / 26
SLIDE 27
- 1. Avoid unnecessary derivations
Keep prime classes
Recall : Each non-terminal corresponds to a substitutability congruence class. Slight abuse of notation : Let [x] denote the class of a non-terminal or terminal x.
N is deterministically derived by N → α = α1 . . . α|α| implies [N] = [α1]...[α|α|]. We name such useless class a composite class (They are those giving rise to vacuous local derivation trees [Clark, 2011]). We say that a class is prime if it is not composite. We have :
Primality test
Let a language L whose set of non-zero and non-unit congruence classes is C +.
A class [y] in C + is prime for L iff ∀y1y2 ∈ [y], [y] ⊂ [y1][y2]. Sufficient test since for syntactic congruence, we have [y1y2] ⊇ [y1][y2]
ReG.*iS ICGI’14 13 / 26
SLIDE 28 K-Primes
Due to monotonicity of generalization, it is safe to filter out composite classes on the basis of the sample K. We introduce the function Prime() for that purpose : Primes()
Input: Set of substitutability classes : CK Output: Set of substitutability classes satisfying primality test in CK : P P ← ∅ for C ∈ CK do If ( ∃C ′ ∈ CK : ∀y ∈ C, ∃y ′ ∈ C ′, ∃v ∈ Σ+, y = y ′v) and ( ∃C ′ ∈ CK : ∀y ∈ C, ∃y ′ ∈ C ′, ∃u ∈ Σ+, y = uy ′) P ← P ∪ C
Primality test in K not in L ! but works well in practice and if the sample is informative enough.
ReG.*iS ICGI’14 14 / 26
SLIDE 29 On the example
’be gone tomorrow evening’ ’be’ ’Major’ ’General’ ’gone tomorrow’ ’be gone tomorrow’ ’was’ ’General Major was here yesterday morning’ ’He will be’ ’will be gone tomorrow’ ’General Major went’ ’General Major was here’ ’was here’ ’will be there tomorrow’ ’there’ ’went here yesterday’ ’morning’ ’Major will be there tomorrow’ ’Major went here yesterday’ ’Major went here’ ’will’ ’General Major’ ’Major will’ ’Major will be’ ’General Major went here yesterday morning’ ’Major went here yesterday morning’ ’be there tomorrow morning’ ’He will be gone tomorrow evening’ ’evening’ ’General Major will be there tomorrow’ ’General Major was here yesterday’ ’here yesterday’ ’here’ ’gone tomorrow evening’ ’Major will be there tomorrow morning’ ’tomorrow morning’ ’went’ ’Major was here yesterday morning’ ’tomorrow evening’ ’Major was here yesterday’ ’gone’ ’General Major will be’ ’here yesterday morning’ ’He will be gone tomorrow’ ’will be gone’ ’yesterday morning’ ’was here yesterday morning’ ’Major was here’ ’there tomorrow’ ’General Major will’ ’General Major went here yesterday’ ’He will be gone’ ’be gone’ ’General Major went here’ ’went here’ ’He will’ ’Major went’ ’Major was’ ’He’ ’there tomorrow morning’ ’went here yesterday morning’ ’General Major will be there tomorrow morning’ ’General Major will be there’ ’tomorrow’ ’will be there’ ’will be gone tomorrow evening’ ’be there’ ’was here yesterday’ ’will be there tomorrow morning’ ’will be’ ’General Major was’ ’be there tomorrow’ ’Major will be there’ ’yesterday’
source: Gaelle Garet
ReG.*iS ICGI’14 15 / 26
SLIDE 30 On the example
’be’ ’Major’ ’was’ ’General Major was here yesterday morning’ ’General Major’ ’will be there tomorrow’ ’there’ ’General’ ’will’ ’General Major went here yesterday morning’ ’He will be gone tomorrow evening’ ’here’ ’gone tomorrow evening’ ’went’ ’morning’ ’evening’ ’gone’ ’He will be gone’ ’there tomorrow morning’ ’General Major will be there tomorrow morning’ ’General Major will be there’ ’tomorrow’ ’was here yesterday’ ’went here yesterday’ ’He’ ’yesterday’
ReG.*iS ICGI’14 16 / 26
SLIDE 31 On the example
’was’
X1
’General Major was here yesterday morning’ ’General Major’
X3
’will be there tomorrow’ ’there’
X5
’General Major went here yesterday morning’ ’He will be gone tomorrow evening’ ’went’ ’morning’
X2
’evening’ ’gone’ ’General Major will be there tomorrow morning’
S
’was here yesterday’
X4
’went here yesterday’ ’He’
ReG.*iS ICGI’14 16 / 26
SLIDE 32
ReGLiS Part1 (Learning Reduced Grammar by k, l-Local Substitutability)
simplified !
Input: Set of sequences K on alphabet Σ, int k, int l Output: Grammar G = Σ, NK, SK, PK /* Prime substitutability classes on K */ CK ← Primes(LS classes(K, k, l)) /* Build initial grammar */ NK ← ∅, PK ← ∅ for C ∈ CK do /* A non-terminal for each K-prime */ NK ← NK ∪ {C} /* A direct production for each substring in the class */ for y ∈ C do PK ← PK ∪ {C → y} /* So far, we have built the initial grammar */
SLIDE 33
ReGLiS Part1 (Learning Reduced Grammar by k, l-Local Substitutability)
simplified !
Input: Set of sequences K on alphabet Σ, int k, int l Output: Grammar G = Σ, NK, SK, PK /* Prime substitutability classes on K */ CK ← Primes(LS classes(K, k, l)) /* Build initial grammar */ NK ← ∅, PK ← ∅ for C ∈ CK do /* A non-terminal for each K-prime */ NK ← NK ∪ {C} /* A direct production for each substring in the class */ for y ∈ C do PK ← PK ∪ {C → y} /* So far, we have built the initial grammar */
SLIDE 34 Initial ’bottom’ grammar
G =< Σ, Vk, P, S > Σ = { General, Major, will, be, gone, there, was, went, He, here, tomorrow, yesterday, morning, evening } Vk = { S, X1, X2, X3, X4, X5 } P = { S → General Major will be there tomorrow morning | General Major was here yesterday morning | General Major went here yesterday morning | He will be gone tomorrow morning X1 → was | went X2 → morning | evening X3 → He | General Major X4 → will be there tomorrow | was here yesterday | went here yesterday X5 → there | gone } A non-terminal for each K-Prime L(G) = K (NO language generalization) for each non-terminal N, L(G, N) = [N]
ReG.*iS ICGI’14 18 / 26
SLIDE 35
- 2. Reduce right-hand-sides
And generalize at once
We ’know’ the interesting substitutability classes (but only part of their content) If a substring from a substitutability class is present in one right-hand side, we have to replace this occurrence by the non-terminal of the class (Weak-implies-Strong generalization)
N1→...β..., N2⇒∗β N1→...N2..., N2⇒∗β
Take care of overlapping occurrences Don’t keep/introduce redundant rules, introduce most general only Some kinship with Minimal Grammar Parsing for smallest grammar problem [Carrascosa et al 2011, Gall´
e, 2011], but with more than
- ne substring per non-terminal
- F. Coste (Dyliss, Inria)
ReG.*iS ICGI’14 19 / 26
SLIDE 36 Example
Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :
1 2 3 4 5 a b c d e N1 N2 N3 N4
ReG.*iS ICGI’14 20 / 26
SLIDE 37 Example
Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :
1 2 3 4 5 a b c d e N1 N2 N3 N4
N → aN1e
ReG.*iS ICGI’14 20 / 26
SLIDE 38 Example
Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :
1 2 3 4 5 a b c d e N1 N2 N3 N4
Can be reduced ! (025 strict subsequence 2 of 0235) N → aN1e
- 2. typo in final version of paper ! replace p7 ’strict substring’ by ’strict subsequence’
- F. Coste (Dyliss, Inria)
ReG.*iS ICGI’14 20 / 26
SLIDE 39 Example
Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :
1 2 3 4 5 a b c d e N1 N2 N3 N4
N → aN1e|N3N2
ReG.*iS ICGI’14 20 / 26
SLIDE 40 Example
Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :
1 2 3 4 5 a b c d e N1 N2 N3 N4
N → aN1e|N3N2 Implemented by dynamic programming on vertices in function Non redundant rhs()
ReG.*iS ICGI’14 20 / 26
SLIDE 41
ReGLiS Part2
simplified !
/* Generalization */ repeat P ← PK ; PK ← ∅ /* Branching rules */ for (C → α) ∈ P ordered by increasing |α| do PG ← Build parsing graph(α, P) for β ∈ Non redundant rhs(PG) do PK ← PK ∪ (C → β) until PK = P SK ← CK(k): k ∈ K return Σ, NK, SK, PK
Build parsing graph
Input: Sequence α, Set of rules P Output: Parsing graph V , E V ← {i ∈ [0, |α|]} /* vertices */ ; E ← ∅ /* labeled directed edges */ for i ∈ V do for j ∈ V : i < j and (i, j) = (0, |α|) do if ∃(C → α[i + 1, j]) ∈ P then E ← E ∪ (i, j, C) return V , E
SLIDE 42
ReGLiS Part2
simplified !
/* Generalization */ repeat P ← PK ; PK ← ∅ /* Branching rules */ for (C → α) ∈ P ordered by increasing |α| do PG ← Build parsing graph(α, P) for β ∈ Non redundant rhs(PG) do PK ← PK ∪ (C → β) until PK = P SK ← CK(k): k ∈ K return Σ, NK, SK, PK
Build parsing graph
Input: Sequence α, Set of rules P Output: Parsing graph V , E V ← {i ∈ [0, |α|]} /* vertices */ ; E ← ∅ /* labeled directed edges */ for i ∈ V do for j ∈ V : i < j and (i, j) = (0, |α|) do if ∃(C → α[i + 1, j]) ∈ P then E ← E ∪ (i, j, C) return V , E
SLIDE 43 Final Grammar
S → X3 X4 X2 X1 → was | went X2 → morning | evening X3 → He | General Major X4 → will be X5 tomorrow | X1 here yesterday X5 → there | gone
ReG.*iS ICGI’14 22 / 26
SLIDE 44 Complexity
Complexity : O(max(l3, l.t))
l : size of longest sequence t : size of target grammar
Run time comparison between old and new learning algorithms implementations :
100 200 300 400 500 600 20 40 60 80 100 120 140 time (seconds) number of strings ReGLiS SGLLS
wrt number of strings in training sample
50 100 150 200 250 300 350 5 10 15 20 25 30 time (seconds) length of strings ReGLiS SGLLS
wrt length of strings in training sample
For our protein experiments : from a day to a few minutes
ReG.*iS ICGI’14 23 / 26
SLIDE 45 Protein experiments (10-fold cross-validation)
Zinc finger MPI phos. Precision Recall F-measure Precision Recall F-measure Subst. 1 0.1 0.36 1 0.15 0.26 3,3-LS 1 0.2 0.33 1 0.5 0.67 4,4-LS 1 0.25 0.4 1 0.6 0.75 5,5-LS 1 0.33 0.5 1 0.67 0.8 6,6-LS 1 0.5 0.67 1 0.62 0.77 7,7-LS 1 0.55 0.7 1 0.53 0.69 SCFG
[Dyrka & Nebel,
2009]
1 0.1 0.18 1 0.3 0.46 0.15 1 0.26 0.5 1 0.67 0.75 0.87 0.85 0.98 0.89 0.93 PS00219 PS00063 Precision Recall F-measure Precision Recall F-measure Subst. 1 0.2 0.33 1 0.23 0.37 3,3-LS 1 0.72 0.84 1 0.58 0.73 4,4-LS 1 0.7 0.82 1 0.6 0.75 5,5-LS 1 0.68 0.8 1 0.66 0.8 6,6-LS 1 0.6 0.75 1 0.7 0.82 7,7-LS 1 0.5 0.67 1 0.65 0.79 Prosite 1 0.6 0.75 1 0.8 0.89 SCFG
[Dyrka & Nebel,
2009]
0.05 0.1
1 0.18 1 1 1 0.79 0.65 0.71
Table : Sequence class prediction by grammars obtained for different families
ReG.*iS ICGI’14 24 / 26
SLIDE 46 Conclusion
ReGLiS : (the ReG.*iS family : ReGiS, ReGCis, ReGLiS, ReGLCiS)
Bottom-up generalization from initial grammar Efficient by dynamic programming on parsing graph No parsing required, iterative
Practical algorithm
faster inference faster parsing (non redundant minimal grammar)
Reduced grammar
Easier to read Canonical form ! cf [Clark, 2013] → Polynomial dentification in the limit property (cf Remi’s talk yesterday)
Confirmation of good results on proteins (with some preprocessing but no statistical parameters) Another step towards practical application of (local-)substitutability
ReG.*iS ICGI’14 25 / 26
SLIDE 47 Perspectives
Practical
Choice of initial classes : data-driven heuristics Grammar weighting for biological sequences Better understand why (local-)substitutability seems pertinent for biological sequences. . . Prototype to efficient implementation ?
Theoretical
Better understand/characterize interest of outer loop in generalization wrt SGL What happens exactly when sample is not characteristic ? Is it possible to ensure always returning a grammar in the target class ?
And attend Gaelle’s thesis (Dec 2014)
ReG.*iS ICGI’14 26 / 26