A bottom-up efficient algorithm learning substitutable languages - - PowerPoint PPT Presentation

a bottom up efficient algorithm learning substitutable
SMART_READER_LITE
LIVE PREVIEW

A bottom-up efficient algorithm learning substitutable languages - - PowerPoint PPT Presentation

A bottom-up efficient algorithm learning substitutable languages from positive examples Fran cois Coste, Gaelle Garet, Jacques Nicolas ICGI, Kyoto, September 17, 2014 F. Coste (Dyliss, Inria) ReG.*iS ICGI14 1 / 26 Motivation


slide-1
SLIDE 1

A bottom-up efficient algorithm learning substitutable languages from positive examples

Fran¸ cois Coste, Gaelle Garet, Jacques Nicolas ICGI, Kyoto, September 17, 2014

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 1 / 26

slide-2
SLIDE 2

Motivation

Distributional Hypothesis (words that occur in the same contexts tend to have similar meanings [Harris, 1954]. ”a word is characterized by the company it keeps” [Firth, 1957]) has been for long an influential idea in Linguistics : Part of the language acquisition discussion. . . Base of Statistical Semantics Unsupervised POS parsing (Constituent-Context Models [Klein &

Manning, 2001]. . . )

Learning expressive grammars from positive examples only

Heuristics : EMILE [Adriaans, 1992 ; Adriaans and Vervoort,

2002)] , ABL [van Zaanen, 2002], ADIOS [Solan et al., 2005]. . .

Characterizable inference of substitutable languages : [Clark &

Eyraud 2007, Yoshinaka 2008, . . . ]

and [CGN2012] for proteins !

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 2 / 26

slide-3
SLIDE 3

Substitutable Languages

L is substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ∧ x2y1z2 ∈ L ⇒ x2y2z2 ∈ L

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-4
SLIDE 4

Substitutable Languages

L is substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-5
SLIDE 5

Substitutable Languages

L is substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2]

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-6
SLIDE 6

Substitutable Languages

L is substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-substitutable [Yoshinaka, 2008] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-7
SLIDE 7

Substitutable Languages

L is substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) L is k, l-substitutable [Yoshinaka, 2008] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-8
SLIDE 8

Substitutable Languages

L is substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) L is k, l-substitutable [Yoshinaka, 2008] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) L is k, l-local-context substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-9
SLIDE 9

Substitutable Languages

L is substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) L is k, l-context-substitutable [Yoshinaka, 2008] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) L is k, l-local-context substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-10
SLIDE 10

Substitutable Languages

L is zero-substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) L is k, l-context-substitutable [Yoshinaka, 2008] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) L is k, l-local-context substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-11
SLIDE 11

Substitutable Languages

L is zero-substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-context-substitutable [Yoshinaka, 2008] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) L is k, l-local-context substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-12
SLIDE 12

Substitutable Languages

L is zero-substitutable [Clark & Eyraud, 2007] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, y1, y2 = λ:

x1y1z1 ∈ L ∧ x1y2z1 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-local substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2y1z2 ∈ L ⇔ x2y2z2 ∈ L) i.e. [y1] = [y2] L is k, l-context-substitutable [Yoshinaka, 2008] if :

x1, y1, z1, x2, y2, z2 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x1uy2vz1 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) i.e. [uy1v] = [uy2v] L is k, l-local-context substitutable [CGN, 2012] if :

x1, y1, z1, x2, y2, z2, x3, z3 ∈ Σ∗, u ∈ Σk, v ∈ Σl, uy1v, uy2v = λ

x1uy1vz1 ∈ L ∧ x3uy2vz3 ∈ L ⇒ (x2uy1vz2 ∈ L ⇔ x2uy2vz2 ∈ L) i.e. [uy1v] = [uy2v]

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 3 / 26

slide-13
SLIDE 13

“Weak-implies-Strong” Generalization

Let K be the following set of strings :

Major General was here yesterday morning. Major General went here yesterday morning. Major General will be there tomorrow morning. He will be gone tomorrow evening.

Strings to add to get a 1, 1-local substitutable language :

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 4 / 26

slide-14
SLIDE 14

“Weak-implies-Strong” Generalization

Let K be the following set of strings :

Major General was here yesterday morning. Major General went here yesterday morning. Major General will be there tomorrow morning. He will be gone tomorrow evening.

Strings to add to get a 1, 1-local substitutable language :

Major General will be gone tomorrow morning. He will be there tomorrow evening.

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 4 / 26

slide-15
SLIDE 15

“Weak-implies-Strong” Generalization

Let K be the following set of strings :

Major General was here yesterday morning. Major General went here yesterday morning. Major General will be there tomorrow morning. He will be gone tomorrow evening.

Strings to add to get a 1, 1-local substitutable language :

Major General will be gone tomorrow morning. He will be there tomorrow evening. He will be gone tomorrow morning. Major General will be there tomorrow evening. Major General was here yesterday evening. Major General went here yesterday evening.

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 4 / 26

slide-16
SLIDE 16

“Weak-implies-Strong” Generalization

Let K be the following set of strings :

Major General was here yesterday morning. Major General went here yesterday morning. Major General will be there tomorrow morning. He will be gone tomorrow evening.

Strings to add to get a 1, 1-local substitutable language :

Major General will be gone tomorrow morning. He will be there tomorrow evening. He will be gone tomorrow morning. Major General will be there tomorrow evening. Major General was here yesterday evening. Major General went here yesterday evening. Major General will be gone tomorrow evening. He will be there tomorrow morning. . . .

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 4 / 26

slide-17
SLIDE 17

Learning k, l-local substitutable languages

Adaptation of SGL algorithm [Clark & Eyraud, 2007] :

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 5 / 26

slide-18
SLIDE 18

SGLLS

(Substitution Graph Learner for Local Substitutable languages)

Input: Set of sequences K on alphabet Σ, int k, int l Output: Grammar G = Σ, NK, SK, PK /* Partition Sub(K) in Local Substitutability classes */ CK ← LS classes(K, k, l) /* ∀y ∈ Sub(K), CK(y) = C ∈ CK : y ∈ C */ /* Build grammar */ NK ← ∅, PK ← ∅ for C ∈ CK do /* A non-terminal for each substitutability class */ NK ← NK ∪ {C} /* Productions rules for each substring in the class */ for y ∈ C do if |y| > 1 then /* Branching rules: a ’CNF’ rule for each split */ for y1 ∈ Σ+, y2 ∈ Σ+ : y1y2 = y do PK ← PK ∪ {C → CK(y1)CK(y2)} else /* Terminal rule */ PK ← PK ∪ {C → y} SK ← CK(k): k ∈ K return Σ, NK, SK, PK

slide-19
SLIDE 19

Local substitutability classes

LS classes()

Input: Set of sequences K on alphabet Σ, int k, int l Output: k, l local substitutability classes on K /* Build substitutability graph on substrings */ V ← {y ∈ Σ+ : y ∈ Sub(K)} E ← {{y1, y2} ∈ V × V : uy1v ∈ Sub(K), uy2v ∈ Sub(K), y1 = y2, u ∈ Σk, v ∈ Σl} /* Return connected components of graph */ return Connected components(V , E) where Sub(K) denotes the set of substrings of K

This is the unique difference with SGL ! To learn other substitutable classes, change E and V initialization.

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 7 / 26

slide-20
SLIDE 20

Substitutability graph

’be gone tomorrow evening’ ’be’ ’Major’ ’General’ ’gone tomorrow’ ’be gone tomorrow’ ’was’ ’General Major was here yesterday morning’ ’He will be’ ’will be gone tomorrow’ ’General Major went’ ’General Major was here’ ’was here’ ’will be there tomorrow’ ’there’ ’went here yesterday’ ’morning’ ’Major will be there tomorrow’ ’Major went here yesterday’ ’Major went here’ ’will’ ’General Major’ ’Major will’ ’Major will be’ ’General Major went here yesterday morning’ ’Major went here yesterday morning’ ’be there tomorrow morning’ ’He will be gone tomorrow evening’ ’evening’ ’General Major will be there tomorrow’ ’General Major was here yesterday’ ’here yesterday’ ’here’ ’gone tomorrow evening’ ’Major will be there tomorrow morning’ ’tomorrow morning’ ’went’ ’Major was here yesterday morning’ ’tomorrow evening’ ’Major was here yesterday’ ’gone’ ’General Major will be’ ’here yesterday morning’ ’He will be gone tomorrow’ ’will be gone’ ’yesterday morning’ ’was here yesterday morning’ ’Major was here’ ’there tomorrow’ ’General Major will’ ’General Major went here yesterday’ ’He will be gone’ ’be gone’ ’General Major went here’ ’went here’ ’He will’ ’Major went’ ’Major was’ ’He’ ’there tomorrow morning’ ’went here yesterday morning’ ’General Major will be there tomorrow morning’ ’General Major will be there’ ’tomorrow’ ’will be there’ ’will be gone tomorrow evening’ ’be there’ ’was here yesterday’ ’will be there tomorrow morning’ ’will be’ ’General Major was’ ’be there tomorrow’ ’Major will be there’ ’yesterday’

source: Gaelle Garet

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 8 / 26

slide-21
SLIDE 21

SGLLS

(Substitution Graph Learner for Local Substitutable languages)

Input: Set of sequences K on alphabet Σ, int k, int l Output: Grammar G = Σ, NK, SK, PK /* Partition Sub(K) in Local Substitutability classes */ CK ← LS classes(K, k, l) /* ∀y ∈ Sub(K), CK(y) = C ∈ CK : y ∈ C */ /* Build grammar */ NK ← ∅, PK ← ∅ for C ∈ CK do /* A non-terminal for each substitutability class */ NK ← NK ∪ {C} /* Productions rules for each substring in the class */ for y ∈ C do if |y| > 1 then /* Branching rules: a ’CNF’ rule for each split */ for y1 ∈ Σ+, y2 ∈ Σ+ : y1y2 = y do PK ← PK ∪ {C → CK(y1)CK(y2)} else /* Terminal rule */ PK ← PK ∪ {C → y} SK ← CK(k): k ∈ K return Σ, NK, SK, PK

slide-22
SLIDE 22

Resulting grammar :

X47 →′ yesterday′ X46 → X3X43|X11X19|X23X13 X45 → X20X2 X44 → X20X1|X45X29|X34X16|X9X15 X43 → X20X19|X45X13 X42 →′ tomorrow′ X41 → X42X15 X40 → X30X46|X21X43|X39X19|X25X13|X21X34|X8X13 N0 → X30X24|X21X31|X10X32|X36X17|X26X15|X39X1 |X25X29|X40X41|X21X44|X8X29|X40X16|X33X15 X29 → X13X16|X4X15|X13X41|X38X15 X28 → X27X47 X25 → X30X23|X21X45|X39X2 X24 → X3X31|X18X32|X37X17|X14X15|X11X1|X23X29|X46X41 X27 →′ here′ X26 → X30X14|X21X12|X10X28|X36X47|X39X35|X25X38|X40X42 X21 →′ He′|X30X3 X20 →′ will′ X23 → X3X45|X11X2 X22 → X6X27 X8 → X21X45|X39X2 X9 → X20X5|X45X4|X34X42 X2 →′ be′ X3 →′ Major′ X1 → X2X29|X19X16|X5X15|X19X41|X35X15 X6 →′ was′|′went′ X4 → X13X42 X5 → X2X4|X19X42 X32 → X27X17|X28X15 X33 → X21X9|X39X5|X8X4|X40X42 X30 → General X31 → X6X32|X22X17|X12X15|X20X1|X45X29|X43X41 X36 → X30X37|X21X22|X10X27 X37 → X3X22|X18X27 X34 → X20X19|X45X13 X35 → X2X38|X19X42 X38 → X13X42 X39 → X30X11|X21X20 X18 → X3X6 X19 → X2X13 X10 → X30X18|X21X6 X11 → X3X20 X12 → X6X28|X22X47|X20X35|X45X38|X43X42 X13 →′ there′|′gone′ X14 → X3X12|X18X28|X37X47|X11X35|X23X38|X46X42 X15 →′ morning′|′evening′ X16 → X42X15 X17 → X47X15 source : GaelleGaret

slide-23
SLIDE 23

Limitations

A lot of non-terminals and rules A lot of redundancy and ambiguity ⇒ Parsing time and learning time problems 1 (+ Illegible)

  • 1. About a day for one experiment on a simple set of proteins
  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 11 / 26

slide-24
SLIDE 24

Limitations

A lot of non-terminals and rules A lot of redundancy and ambiguity ⇒ Parsing time and learning time problems 1 (+ Illegible)

A solution

Reduce the grammar before parsing

  • 1. About a day for one experiment on a simple set of proteins
  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 11 / 26

slide-25
SLIDE 25

Limitations

A lot of non-terminals and rules A lot of redundancy and ambiguity ⇒ Parsing time and learning time problems 1 (+ Illegible)

A solution

Reduce the grammar during the inference

  • 1. About a day for one experiment on a simple set of proteins
  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 11 / 26

slide-26
SLIDE 26

Ideas

1 Avoid unnecessary derivations by removing non-terminals with a deterministic derivation : If A is the left-hand-side of only one rule A → α then delete them and replace all B → . . . A . . . by B → . . . α . . .. 2 Reduce the right-hand-sides of the production rules : For each N → . . . β . . ., if there exists α : |α| < |β| ∧ α ⇒ β, then replace N → . . . β . . . by N → . . . α . . . (Teaser : it ensures also maximal generalization !)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 12 / 26

slide-27
SLIDE 27
  • 1. Avoid unnecessary derivations

Keep prime classes

Recall : Each non-terminal corresponds to a substitutability congruence class. Slight abuse of notation : Let [x] denote the class of a non-terminal or terminal x.

N is deterministically derived by N → α = α1 . . . α|α| implies [N] = [α1]...[α|α|]. We name such useless class a composite class (They are those giving rise to vacuous local derivation trees [Clark, 2011]). We say that a class is prime if it is not composite. We have :

Primality test

Let a language L whose set of non-zero and non-unit congruence classes is C +.

A class [y] in C + is prime for L iff ∀y1y2 ∈ [y], [y] ⊂ [y1][y2]. Sufficient test since for syntactic congruence, we have [y1y2] ⊇ [y1][y2]

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 13 / 26

slide-28
SLIDE 28

K-Primes

Due to monotonicity of generalization, it is safe to filter out composite classes on the basis of the sample K. We introduce the function Prime() for that purpose : Primes()

Input: Set of substitutability classes : CK Output: Set of substitutability classes satisfying primality test in CK : P P ← ∅ for C ∈ CK do If ( ∃C ′ ∈ CK : ∀y ∈ C, ∃y ′ ∈ C ′, ∃v ∈ Σ+, y = y ′v) and ( ∃C ′ ∈ CK : ∀y ∈ C, ∃y ′ ∈ C ′, ∃u ∈ Σ+, y = uy ′) P ← P ∪ C

Primality test in K not in L ! but works well in practice and if the sample is informative enough.

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 14 / 26

slide-29
SLIDE 29

On the example

’be gone tomorrow evening’ ’be’ ’Major’ ’General’ ’gone tomorrow’ ’be gone tomorrow’ ’was’ ’General Major was here yesterday morning’ ’He will be’ ’will be gone tomorrow’ ’General Major went’ ’General Major was here’ ’was here’ ’will be there tomorrow’ ’there’ ’went here yesterday’ ’morning’ ’Major will be there tomorrow’ ’Major went here yesterday’ ’Major went here’ ’will’ ’General Major’ ’Major will’ ’Major will be’ ’General Major went here yesterday morning’ ’Major went here yesterday morning’ ’be there tomorrow morning’ ’He will be gone tomorrow evening’ ’evening’ ’General Major will be there tomorrow’ ’General Major was here yesterday’ ’here yesterday’ ’here’ ’gone tomorrow evening’ ’Major will be there tomorrow morning’ ’tomorrow morning’ ’went’ ’Major was here yesterday morning’ ’tomorrow evening’ ’Major was here yesterday’ ’gone’ ’General Major will be’ ’here yesterday morning’ ’He will be gone tomorrow’ ’will be gone’ ’yesterday morning’ ’was here yesterday morning’ ’Major was here’ ’there tomorrow’ ’General Major will’ ’General Major went here yesterday’ ’He will be gone’ ’be gone’ ’General Major went here’ ’went here’ ’He will’ ’Major went’ ’Major was’ ’He’ ’there tomorrow morning’ ’went here yesterday morning’ ’General Major will be there tomorrow morning’ ’General Major will be there’ ’tomorrow’ ’will be there’ ’will be gone tomorrow evening’ ’be there’ ’was here yesterday’ ’will be there tomorrow morning’ ’will be’ ’General Major was’ ’be there tomorrow’ ’Major will be there’ ’yesterday’

source: Gaelle Garet

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 15 / 26

slide-30
SLIDE 30

On the example

’be’ ’Major’ ’was’ ’General Major was here yesterday morning’ ’General Major’ ’will be there tomorrow’ ’there’ ’General’ ’will’ ’General Major went here yesterday morning’ ’He will be gone tomorrow evening’ ’here’ ’gone tomorrow evening’ ’went’ ’morning’ ’evening’ ’gone’ ’He will be gone’ ’there tomorrow morning’ ’General Major will be there tomorrow morning’ ’General Major will be there’ ’tomorrow’ ’was here yesterday’ ’went here yesterday’ ’He’ ’yesterday’

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 16 / 26

slide-31
SLIDE 31

On the example

’was’

X1

’General Major was here yesterday morning’ ’General Major’

X3

’will be there tomorrow’ ’there’

X5

’General Major went here yesterday morning’ ’He will be gone tomorrow evening’ ’went’ ’morning’

X2

’evening’ ’gone’ ’General Major will be there tomorrow morning’

S

’was here yesterday’

X4

’went here yesterday’ ’He’

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 16 / 26

slide-32
SLIDE 32

ReGLiS Part1 (Learning Reduced Grammar by k, l-Local Substitutability)

simplified !

Input: Set of sequences K on alphabet Σ, int k, int l Output: Grammar G = Σ, NK, SK, PK /* Prime substitutability classes on K */ CK ← Primes(LS classes(K, k, l)) /* Build initial grammar */ NK ← ∅, PK ← ∅ for C ∈ CK do /* A non-terminal for each K-prime */ NK ← NK ∪ {C} /* A direct production for each substring in the class */ for y ∈ C do PK ← PK ∪ {C → y} /* So far, we have built the initial grammar */

slide-33
SLIDE 33

ReGLiS Part1 (Learning Reduced Grammar by k, l-Local Substitutability)

simplified !

Input: Set of sequences K on alphabet Σ, int k, int l Output: Grammar G = Σ, NK, SK, PK /* Prime substitutability classes on K */ CK ← Primes(LS classes(K, k, l)) /* Build initial grammar */ NK ← ∅, PK ← ∅ for C ∈ CK do /* A non-terminal for each K-prime */ NK ← NK ∪ {C} /* A direct production for each substring in the class */ for y ∈ C do PK ← PK ∪ {C → y} /* So far, we have built the initial grammar */

slide-34
SLIDE 34

Initial ’bottom’ grammar

G =< Σ, Vk, P, S > Σ = { General, Major, will, be, gone, there, was, went, He, here, tomorrow, yesterday, morning, evening } Vk = { S, X1, X2, X3, X4, X5 } P = { S → General Major will be there tomorrow morning | General Major was here yesterday morning | General Major went here yesterday morning | He will be gone tomorrow morning X1 → was | went X2 → morning | evening X3 → He | General Major X4 → will be there tomorrow | was here yesterday | went here yesterday X5 → there | gone } A non-terminal for each K-Prime L(G) = K (NO language generalization) for each non-terminal N, L(G, N) = [N]

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 18 / 26

slide-35
SLIDE 35
  • 2. Reduce right-hand-sides

And generalize at once

We ’know’ the interesting substitutability classes (but only part of their content) If a substring from a substitutability class is present in one right-hand side, we have to replace this occurrence by the non-terminal of the class (Weak-implies-Strong generalization)

N1→...β..., N2⇒∗β N1→...N2..., N2⇒∗β

Take care of overlapping occurrences Don’t keep/introduce redundant rules, introduce most general only Some kinship with Minimal Grammar Parsing for smallest grammar problem [Carrascosa et al 2011, Gall´

e, 2011], but with more than

  • ne substring per non-terminal
  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 19 / 26

slide-36
SLIDE 36

Example

Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :

1 2 3 4 5 a b c d e N1 N2 N3 N4

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 20 / 26

slide-37
SLIDE 37

Example

Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :

1 2 3 4 5 a b c d e N1 N2 N3 N4

N → aN1e

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 20 / 26

slide-38
SLIDE 38

Example

Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :

1 2 3 4 5 a b c d e N1 N2 N3 N4

Can be reduced ! (025 strict subsequence 2 of 0235) N → aN1e

  • 2. typo in final version of paper ! replace p7 ’strict substring’ by ’strict subsequence’
  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 20 / 26

slide-39
SLIDE 39

Example

Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :

1 2 3 4 5 a b c d e N1 N2 N3 N4

N → aN1e|N3N2

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 20 / 26

slide-40
SLIDE 40

Example

Let us consider that so far the grammar is such that P = { . . . ; N → abcde | . . . ; N1 → bcd | . . . ; N2 → cde | . . . ; N3 → ab | . . . ; N4 → de | . . . ; . . . } (where a, b, c, d, e is a terminal or non-terminal symbols) The parsing graph for N → abcde is then :

1 2 3 4 5 a b c d e N1 N2 N3 N4

N → aN1e|N3N2 Implemented by dynamic programming on vertices in function Non redundant rhs()

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 20 / 26

slide-41
SLIDE 41

ReGLiS Part2

simplified !

/* Generalization */ repeat P ← PK ; PK ← ∅ /* Branching rules */ for (C → α) ∈ P ordered by increasing |α| do PG ← Build parsing graph(α, P) for β ∈ Non redundant rhs(PG) do PK ← PK ∪ (C → β) until PK = P SK ← CK(k): k ∈ K return Σ, NK, SK, PK

Build parsing graph

Input: Sequence α, Set of rules P Output: Parsing graph V , E V ← {i ∈ [0, |α|]} /* vertices */ ; E ← ∅ /* labeled directed edges */ for i ∈ V do for j ∈ V : i < j and (i, j) = (0, |α|) do if ∃(C → α[i + 1, j]) ∈ P then E ← E ∪ (i, j, C) return V , E

slide-42
SLIDE 42

ReGLiS Part2

simplified !

/* Generalization */ repeat P ← PK ; PK ← ∅ /* Branching rules */ for (C → α) ∈ P ordered by increasing |α| do PG ← Build parsing graph(α, P) for β ∈ Non redundant rhs(PG) do PK ← PK ∪ (C → β) until PK = P SK ← CK(k): k ∈ K return Σ, NK, SK, PK

Build parsing graph

Input: Sequence α, Set of rules P Output: Parsing graph V , E V ← {i ∈ [0, |α|]} /* vertices */ ; E ← ∅ /* labeled directed edges */ for i ∈ V do for j ∈ V : i < j and (i, j) = (0, |α|) do if ∃(C → α[i + 1, j]) ∈ P then E ← E ∪ (i, j, C) return V , E

slide-43
SLIDE 43

Final Grammar

S → X3 X4 X2 X1 → was | went X2 → morning | evening X3 → He | General Major X4 → will be X5 tomorrow | X1 here yesterday X5 → there | gone

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 22 / 26

slide-44
SLIDE 44

Complexity

Complexity : O(max(l3, l.t))

l : size of longest sequence t : size of target grammar

Run time comparison between old and new learning algorithms implementations :

100 200 300 400 500 600 20 40 60 80 100 120 140 time (seconds) number of strings ReGLiS SGLLS

wrt number of strings in training sample

50 100 150 200 250 300 350 5 10 15 20 25 30 time (seconds) length of strings ReGLiS SGLLS

wrt length of strings in training sample

For our protein experiments : from a day to a few minutes

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 23 / 26

slide-45
SLIDE 45

Protein experiments (10-fold cross-validation)

Zinc finger MPI phos. Precision Recall F-measure Precision Recall F-measure Subst. 1 0.1 0.36 1 0.15 0.26 3,3-LS 1 0.2 0.33 1 0.5 0.67 4,4-LS 1 0.25 0.4 1 0.6 0.75 5,5-LS 1 0.33 0.5 1 0.67 0.8 6,6-LS 1 0.5 0.67 1 0.62 0.77 7,7-LS 1 0.55 0.7 1 0.53 0.69 SCFG

[Dyrka & Nebel,

2009]

1 0.1 0.18 1 0.3 0.46 0.15 1 0.26 0.5 1 0.67 0.75 0.87 0.85 0.98 0.89 0.93 PS00219 PS00063 Precision Recall F-measure Precision Recall F-measure Subst. 1 0.2 0.33 1 0.23 0.37 3,3-LS 1 0.72 0.84 1 0.58 0.73 4,4-LS 1 0.7 0.82 1 0.6 0.75 5,5-LS 1 0.68 0.8 1 0.66 0.8 6,6-LS 1 0.6 0.75 1 0.7 0.82 7,7-LS 1 0.5 0.67 1 0.65 0.79 Prosite 1 0.6 0.75 1 0.8 0.89 SCFG

[Dyrka & Nebel,

2009]

  • 1

0.05 0.1

  • 0.1

1 0.18 1 1 1 0.79 0.65 0.71

Table : Sequence class prediction by grammars obtained for different families

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 24 / 26

slide-46
SLIDE 46

Conclusion

ReGLiS : (the ReG.*iS family : ReGiS, ReGCis, ReGLiS, ReGLCiS)

Bottom-up generalization from initial grammar Efficient by dynamic programming on parsing graph No parsing required, iterative

Practical algorithm

faster inference faster parsing (non redundant minimal grammar)

Reduced grammar

Easier to read Canonical form ! cf [Clark, 2013] → Polynomial dentification in the limit property (cf Remi’s talk yesterday)

Confirmation of good results on proteins (with some preprocessing but no statistical parameters) Another step towards practical application of (local-)substitutability

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 25 / 26

slide-47
SLIDE 47

Perspectives

Practical

Choice of initial classes : data-driven heuristics Grammar weighting for biological sequences Better understand why (local-)substitutability seems pertinent for biological sequences. . . Prototype to efficient implementation ?

Theoretical

Better understand/characterize interest of outer loop in generalization wrt SGL What happens exactly when sample is not characteristic ? Is it possible to ensure always returning a grammar in the target class ?

And attend Gaelle’s thesis (Dec 2014)

  • F. Coste (Dyliss, Inria)

ReG.*iS ICGI’14 26 / 26